cs336-03-性能分析

cs336-03-性能分析
gogongxt测试报告
下面以测试2.7B模型,上下文长度512,分别测试forward,forward+backward,train为例
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
==========================================
Model: 2.7B, Context Length: 512
==========================================
Profiling: size=2.7B, context_length=512, type=forward
Collecting data...
Using device: cuda
Model configuration:
Size: 2.7B
d_model: 2560
d_ff: 10240
num_layers: 32
num_heads: 32
vocab_size: 10000
context_length: 512
batch_size: 4
Initializing model...
Creating random batch...
Running profiled step 1/1...
Profiling complete.
Generating '/tmp/nsys-report-eba2.qdstrm'
[1/8] [========================100%] 2.7B_ctx512_forward.nsys-rep
Processing 259647 events:
[2/8] [========================100%] 2.7B_ctx512_forward.sqlite
[3/8] Executing 'nvtx_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- --------------- --------------- ------------- ------------- ------------- ------- ---------------------------------------
27.4 3,509,732,403 1 3,509,732,403.0 3,509,732,403.0 3,509,732,403 3,509,732,403 0.0 PushPop :warmup
15.3 1,951,345,647 128 15,244,887.9 8,557,875.0 4,580,127 867,075,963 75,970,548.0 PushPop :attention_sublayer
7.3 932,033,009 4 233,008,252.3 17,638,398.5 9,106,138 887,650,074 436,477,614.6 PushPop :layer_0
6.6 844,667,951 1 844,667,951.0 844,667,951.0 844,667,951 844,667,951 0.0 PushPop :profiled_step
6.6 844,640,598 1 844,640,598.0 844,640,598.0 844,640,598 844,640,598 0.0 PushPop :forward_pass
5.1 652,927,547 128 5,100,996.5 3,539,440.0 1,209,669 20,497,491 5,289,676.2 PushPop :ffn_sublayer
4.2 532,425,153 128 4,159,571.5 2,589,074.5 1,357,137 132,770,103 11,622,481.6 PushPop :scaled_dot_product_attention
4.2 531,599,596 128 4,153,121.8 2,584,390.0 1,353,269 132,742,540 11,620,311.7 PushPop :annotated_scaled_dot_product_attention
3.0 386,992,835 128 3,023,381.5 2,160,590.0 1,661,082 92,582,272 8,013,505.7 PushPop :rope encoding
2.3 297,347,995 128 2,323,031.2 925,472.0 342,711 190,186,456 16,748,871.4 PushPop :attention projections
1.8 229,295,364 128 1,791,370.0 1,137,622.5 382,199 58,412,332 5,163,410.8 PushPop :computing softmax
1.3 161,988,135 128 1,265,532.3 825,050.5 336,479 24,145,062 2,225,677.1 PushPop :computing attention scores
0.7 93,249,368 128 728,510.7 258,756.5 187,597 38,442,354 3,456,221.0 PushPop :causal mask construction
0.7 89,274,806 4 22,318,701.5 23,529,256.5 16,101,895 26,114,398 4,805,676.2 PushPop :layer_29
0.7 88,709,681 4 22,177,420.3 24,795,781.5 13,006,635 26,111,483 6,204,590.2 PushPop :layer_28
0.7 85,584,381 4 21,396,095.3 23,124,333.5 12,993,029 26,342,685 6,290,173.3 PushPop :layer_30
0.7 83,281,904 4 20,820,476.0 22,308,517.0 12,331,664 26,333,206 6,704,505.7 PushPop :layer_27
0.6 79,761,791 4 19,940,447.8 20,496,168.0 12,646,931 26,122,524 7,181,660.7 PushPop :layer_24
0.6 79,581,900 4 19,895,475.0 20,983,472.5 11,494,575 26,120,380 7,366,885.9 PushPop :layer_31
0.6 76,702,512 4 19,175,628.0 19,460,410.5 11,657,183 26,124,508 8,036,266.3 PushPop :layer_26
0.6 76,223,576 4 19,055,894.0 19,060,257.5 12,022,599 26,080,462 8,111,200.6 PushPop :layer_25
0.5 62,961,418 4 15,740,354.5 15,874,480.0 13,117,802 18,094,656 2,471,137.6 PushPop :layer_23
0.4 57,511,056 128 449,305.1 380,003.5 336,599 2,851,142 244,297.6 PushPop :final matmul
0.4 51,890,826 4 12,972,706.5 12,655,352.5 5,844,914 20,735,207 8,179,394.2 PushPop :layer_16
0.4 50,645,806 4 12,661,451.5 604,773.5 456,133 48,980,126 24,212,577.8 PushPop :token_embeddings
0.4 50,584,668 4 12,646,167.0 8,811,446.5 5,912,356 27,049,419 9,955,475.3 PushPop :layer_3
0.4 49,668,133 4 12,417,033.3 10,926,908.0 5,941,683 21,872,634 7,861,423.4 PushPop :layer_15
0.4 49,584,559 4 12,396,139.8 11,964,785.0 5,999,837 19,655,152 7,368,607.1 PushPop :layer_1
0.4 46,970,531 4 11,742,632.8 11,442,507.5 5,916,804 18,168,712 6,742,340.5 PushPop :layer_5
0.4 46,459,330 4 11,614,832.5 11,104,466.0 5,851,025 18,399,373 6,605,588.5 PushPop :layer_18
0.4 45,400,676 4 11,350,169.0 10,986,790.0 5,820,215 17,606,881 6,364,266.3 PushPop :layer_17
0.4 45,390,007 4 11,347,501.8 10,124,156.5 6,058,071 19,083,623 6,395,641.8 PushPop :layer_9
0.4 45,289,429 4 11,322,357.3 10,261,495.5 5,941,001 18,825,437 6,402,332.2 PushPop :layer_11
0.3 42,348,161 4 10,587,040.3 10,304,710.0 5,959,809 15,778,932 5,311,080.9 PushPop :layer_20
0.3 42,093,984 4 10,523,496.0 10,286,389.5 5,883,048 15,638,157 5,220,993.6 PushPop :layer_21
0.3 41,921,005 4 10,480,251.3 9,575,562.5 5,981,601 16,788,279 5,375,139.7 PushPop :layer_10
0.3 41,690,223 4 10,422,555.8 10,410,256.5 5,912,977 14,956,733 5,187,652.1 PushPop :layer_6
0.3 40,567,013 4 10,141,753.3 9,421,661.0 5,993,133 15,730,558 4,846,130.0 PushPop :layer_7
0.3 40,392,251 4 10,098,062.8 9,329,130.0 5,898,528 15,835,463 4,945,688.4 PushPop :layer_2
0.3 39,925,348 4 9,981,337.0 8,979,661.0 5,895,222 16,070,804 4,944,447.8 PushPop :layer_4
0.3 39,694,847 4 9,923,711.8 9,852,875.0 5,869,793 14,119,304 4,605,114.7 PushPop :layer_19
0.3 39,655,318 4 9,913,829.5 9,747,448.5 5,896,935 14,263,486 4,581,969.7 PushPop :layer_14
0.3 39,558,050 4 9,889,512.5 9,439,206.0 5,856,877 14,822,761 4,638,748.7 PushPop :layer_12
0.3 39,198,423 4 9,799,605.8 9,470,652.0 6,008,974 14,248,145 4,402,646.5 PushPop :layer_8
0.3 38,472,094 4 9,618,023.5 9,104,621.0 5,950,370 14,312,482 4,282,023.8 PushPop :layer_22
0.3 37,093,394 4 9,273,348.5 8,965,593.5 6,000,618 13,161,589 3,763,297.4 PushPop :layer_13
0.2 22,541,470 128 176,105.2 162,453.0 146,497 425,122 36,386.6 PushPop :output projection
0.1 9,047,845 4 2,261,961.3 1,480,433.5 841,087 5,245,891 2,044,213.0 PushPop :lm_head
0.1 8,956,936 4 2,239,234.0 2,242,350.0 516,411 3,955,825 1,967,245.3 PushPop :final_norm
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------ ------------- --------- ----------- ------------ ----------------------
74.9 10,727,933,987 118 90,914,694.8 100,118,737.0 4,048 214,439,550 31,076,672.9 poll
13.6 1,941,101,621 4,750 408,653.0 111,448.0 1,002 10,931,032 536,883.6 read
7.5 1,067,712,018 2,197 485,986.4 466,703.0 1,843 14,546,821 425,915.5 open64
3.0 431,916,401 4,551 94,905.8 49,647.0 1,112 20,974,114 412,486.3 ioctl
0.3 40,463,016 17,504 2,311.6 2,084.0 1,793 614,263 4,876.9 mmap64
0.2 24,501,776 113 216,829.9 1,924.0 1,172 12,371,822 1,573,013.9 fopen
0.1 21,419,709 938 22,835.5 4,238.0 2,274 14,808,663 483,320.9 socket
0.1 19,886,714 96 207,153.3 111,298.0 77,010 5,230,030 577,880.0 pthread_create
0.1 13,123,209 4,584 2,862.8 2,289.5 1,001 53,244 3,143.2 munmap
0.1 10,120,083 2 5,060,041.5 5,060,041.5 5,059,546 5,060,537 700.7 nanosleep
0.1 8,890,026 65 136,769.6 57,092.0 48,024 5,064,055 620,790.1 sleep
0.0 5,969,677 2,585 2,309.4 1,663.0 1,001 208,719 5,365.4 write
0.0 5,784,467 862 6,710.5 4,408.0 1,753 89,365 6,193.8 mmap
0.0 1,896,913 3 632,304.3 852,759.0 4,389 1,039,765 551,770.8 fread
0.0 1,363,358 91 14,982.0 15,731.0 1,262 59,928 10,432.9 fgets
0.0 1,169,250 8 146,156.3 42,858.5 6,964 520,489 204,776.0 fopen64
0.0 528,625 11 48,056.8 33,536.0 24,277 165,734 40,096.1 sem_timedwait
0.0 314,546 195 1,613.1 1,222.0 1,002 11,162 1,526.1 pthread_cond_signal
0.0 263,565 15 17,571.0 4,919.0 1,312 171,325 43,090.0 open
0.0 161,136 1 161,136.0 161,136.0 161,136 161,136 0.0 pthread_cond_wait
0.0 74,766 7 10,680.9 11,422.0 2,354 15,871 4,097.6 msgsnd
0.0 56,841 11 5,167.4 5,240.0 1,002 14,378 4,056.6 pthread_cond_broadcast
0.0 13,006 3 4,335.3 5,561.0 1,583 5,862 2,388.3 pipe2
0.0 8,367 7 1,195.3 1,172.0 1,072 1,373 113.6 fcntl
0.0 4,679 2 2,339.5 2,339.5 1,683 2,996 928.4 sigaction
0.0 3,968 2 1,984.0 1,984.0 1,613 2,355 524.7 fwrite
0.0 2,385 2 1,192.5 1,192.5 1,113 1,272 112.4 fflush
0.0 2,134 1 2,134.0 2,134.0 2,134 2,134 0.0 openat64
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ----------- ----------- ------------ ---------------------------------
27.3 3,642,255,174 292 12,473,476.6 6,686,692.0 10,150 77,013,961 16,971,943.3 cudaMemcpyAsync
27.3 3,641,878,090 292 12,472,185.2 6,686,030.5 9,779 77,013,541 16,971,857.3 cudaMemcpyAsync
12.3 1,644,181,348 4 411,045,337.0 445,189,418.5 292,173,892 461,628,619 80,323,915.4 cudaDeviceSynchronize
12.3 1,644,175,465 4 411,043,866.3 445,187,835.0 292,171,828 461,627,967 80,324,401.9 cudaDeviceSynchronize
5.6 743,243,840 7,329 101,411.4 10,411.0 5,751 44,795,302 1,077,893.2 cudaLaunchKernel
5.5 740,041,084 7,329 100,974.4 10,049.0 5,531 44,793,449 1,077,873.4 cudaLaunchKernel
4.7 633,610,941 929 682,035.5 551,860.0 310,177 15,563,932 652,309.7 cudaMalloc
4.7 632,723,037 929 681,079.7 551,389.0 309,776 15,563,390 652,166.2 cudaMalloc
0.1 13,703,250 4 3,425,812.5 4,239,210.0 799,906 4,424,924 1,752,800.4 cuLibraryLoadData
0.1 7,095,752 1 7,095,752.0 7,095,752.0 7,095,752 7,095,752 0.0 cudaFree
0.1 7,093,678 1 7,093,678.0 7,093,678.0 7,093,678 7,093,678 0.0 cudaFree
0.0 4,429,468 292 15,169.4 16,457.5 2,194 62,843 7,951.2 cudaStreamSynchronize
0.0 3,988,373 292 13,658.8 14,809.0 1,974 60,820 7,014.9 cudaStreamSynchronize
0.0 2,094,160 928 2,256.6 1,483.0 481 22,935 1,797.5 cudaStreamIsCapturing_v10000
0.0 962,354 2 481,177.0 481,177.0 458,968 503,386 31,408.3 cudaGetDeviceProperties_v2_v12000
0.0 878,711 18 48,817.3 441.0 421 861,887 202,920.0 cudaEventCreateWithFlags
0.0 215,729 784 275.2 171.0 101 5,350 254.1 cuGetProcAddress_v2
0.0 8,106 3 2,702.0 1,503.0 1,122 5,481 2,414.2 cuInit
0.0 7,997 2 3,998.5 3,998.5 1,674 6,323 3,287.3 cudaGetDriverEntryPoint_v11030
0.0 4,440 4 1,110.0 1,158.0 641 1,483 348.8 cuLibraryGetKernel
0.0 1,563 3 521.0 281.0 280 1,002 416.6 cuModuleGetLoadingMode
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- --------- ----------- ----------------------------------------------------------------------------------------------------
88.3 3,008,940,150 900 3,343,266.8 1,452,872.0 1,437,240 7,050,554 2,162,776.2 ampere_sgemm_128x64_tn
1.9 63,096,270 128 492,939.6 485,293.5 481,309 594,524 27,582.1 ampere_sgemm_128x128_nn
1.6 55,296,126 1,544 35,813.6 37,727.0 29,088 46,592 4,304.1 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
1.2 42,029,374 128 328,354.5 323,230.5 320,959 398,174 18,728.8 ampere_sgemm_128x128_tn
1.1 36,128,955 256 141,128.7 138,704.0 133,376 149,663 6,500.2 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, floa…
0.8 26,187,541 128 204,590.2 202,383.0 201,343 234,207 8,048.8 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.7 24,087,967 128 188,187.2 186,095.0 184,735 216,511 7,691.9 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.7 23,876,358 768 31,089.0 32,704.0 24,000 40,064 4,148.9 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.6 22,130,406 128 172,893.8 171,551.0 170,911 191,967 4,978.5 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.6 19,779,769 128 154,529.4 154,575.5 145,855 157,855 1,514.1 void at::native::vectorized_elementwise_kernel<(int)4, at::native::exp_kernel_cuda(at::TensorIterat…
0.6 19,120,797 128 149,381.2 149,215.0 148,191 151,583 734.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<float, float, floa…
0.5 15,628,649 768 20,349.8 14,400.0 13,824 33,856 8,413.1 void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctor_add<float>, std::arr…
0.4 14,559,957 128 113,749.7 112,543.5 111,647 130,528 4,432.9 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<flo…
0.3 11,429,057 128 89,289.5 88,703.0 86,815 99,455 2,424.3 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapp…
0.3 11,138,154 128 87,016.8 86,847.5 86,399 89,664 612.2 void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_kernel_cuda(at::TensorIt…
0.2 5,567,999 260 21,415.4 21,312.0 20,032 25,600 574.4 void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
0.1 4,828,940 260 18,572.8 18,432.0 17,376 21,632 809.7 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MeanOps<fl…
0.1 2,217,561 260 8,529.1 7,776.0 7,008 58,336 5,303.0 void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void …
0.0 608,414 128 4,753.2 4,704.0 4,480 5,984 280.7 void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::n…
0.0 571,165 260 2,196.8 2,176.0 1,984 3,040 170.3 void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctorOnSelf_add<float>, st…
0.0 522,784 260 2,010.7 1,984.0 1,920 2,464 119.6 void at::native::vectorized_elementwise_kernel<(int)4, at::native::rsqrt_kernel_cuda(at::TensorIter…
0.0 520,443 256 2,033.0 2,016.0 1,760 2,720 169.9 void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…
0.0 249,437 128 1,948.7 1,920.0 1,760 2,688 148.5 void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, std::array<c…
0.0 3,392 1 3,392.0 3,392.0 3,392 3,392 0.0 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<unsigned int, (int)4, void …
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- ------------ ----------- -------- ---------- ------------ ----------------------------
100.0 3,548,951,388 292 12,153,943.1 6,252,206.0 1,888 76,263,521 16,791,999.3 [CUDA memcpy Host-to-Device]
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ----------------------------
13,627.402 292 46.669 26.214 0.010 104.858 42.616 [CUDA memcpy Host-to-Device]
Generated:
/nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_forward.nsys-rep
/nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_forward.sqlite
Saved: nsys_results/2.7B_ctx512_forward.nsys-rep
Profiling: size=2.7B, context_length=512, type=backward
Collecting data...
Using device: cuda
Model configuration:
Size: 2.7B
d_model: 2560
d_ff: 10240
num_layers: 32
num_heads: 32
vocab_size: 10000
context_length: 512
batch_size: 4
Initializing model...
Creating random batch...
Running profiled step 1/1...
Profiling complete.
Generating '/tmp/nsys-report-643e.qdstrm'
[1/8] [========================100%] 2.7B_ctx512_backward.nsys-rep
Processing 319426 events:
[2/8] [========================100%] 2.7B_ctx512_backward.sqlite
[3/8] Executing 'nvtx_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- --------------- --------------- ------------- ------------- ------------- ------- ---------------------------------------
40.0 9,038,701,145 1 9,038,701,145.0 9,038,701,145.0 9,038,701,145 9,038,701,145 0.0 PushPop :warmup
11.4 2,580,601,956 1 2,580,601,956.0 2,580,601,956.0 2,580,601,956 2,580,601,956 0.0 PushPop :profiled_step
9.3 2,096,925,001 128 16,382,226.6 6,127,347.0 4,487,526 956,376,632 84,784,737.8 PushPop :attention_sublayer
7.7 1,734,457,535 1 1,734,457,535.0 1,734,457,535.0 1,734,457,535 1,734,457,535 0.0 PushPop :backward_pass
4.4 997,869,240 4 249,467,310.0 7,439,170.0 7,001,278 975,989,622 484,348,331.1 PushPop :layer_0
3.7 842,812,940 1 842,812,940.0 842,812,940.0 842,812,940 842,812,940 0.0 PushPop :forward_pass
3.3 739,320,395 128 5,775,940.6 1,336,301.5 1,194,849 62,729,175 8,455,523.5 PushPop :ffn_sublayer
2.8 627,142,985 128 4,899,554.6 1,822,227.5 1,329,303 144,953,360 15,513,585.1 PushPop :scaled_dot_product_attention
2.8 626,524,630 128 4,894,723.7 1,815,875.5 1,325,455 144,932,249 15,512,243.9 PushPop :annotated_scaled_dot_product_attention
1.6 362,249,576 128 2,830,074.8 1,741,765.5 1,635,813 100,411,102 8,720,479.9 PushPop :rope encoding
1.5 333,126,417 128 2,602,550.1 376,777.5 332,752 246,677,283 21,771,285.2 PushPop :attention projections
1.2 274,709,245 128 2,146,166.0 860,099.0 382,258 75,246,383 7,651,076.2 PushPop :computing softmax
1.2 271,745,113 4 67,936,278.3 26,154,763.0 21,527,378 197,908,209 86,675,407.8 PushPop :layer_29
1.0 227,741,444 128 1,779,230.0 370,736.0 307,953 46,345,999 6,204,920.8 PushPop :computing attention scores
0.6 142,421,504 4 35,605,376.0 26,077,867.5 25,932,182 64,333,587 19,152,286.3 PushPop :layer_26
0.6 136,041,482 4 34,010,370.5 28,645,670.0 26,109,819 52,640,323 12,643,561.3 PushPop :layer_28
0.6 126,512,656 4 31,628,164.0 26,205,958.0 26,082,647 48,018,093 10,927,151.8 PushPop :layer_30
0.5 114,962,665 128 898,145.8 203,293.0 184,571 46,472,848 5,076,259.9 PushPop :causal mask construction
0.4 95,830,304 4 23,957,576.0 26,105,897.0 17,486,156 26,132,354 4,314,313.3 PushPop :layer_27
0.4 88,848,393 4 22,212,098.3 25,692,813.0 11,293,521 26,169,246 7,288,336.9 PushPop :layer_31
0.4 88,736,626 4 22,184,156.5 24,073,262.5 14,438,429 26,151,672 5,503,309.8 PushPop :layer_25
0.4 79,379,235 4 19,844,808.8 22,947,263.0 7,351,133 26,133,576 8,849,435.0 PushPop :layer_24
0.3 76,297,051 4 19,074,262.8 7,165,048.5 5,875,173 56,091,781 24,704,712.8 PushPop :layer_13
0.3 59,512,107 4 14,878,026.8 14,860,931.0 7,436,058 22,354,187 6,468,240.0 PushPop :layer_23
0.2 55,741,434 4 13,935,358.5 448,884.0 249,548 54,594,118 27,106,006.7 PushPop :token_embeddings
0.2 50,340,479 128 393,285.0 368,376.5 329,164 2,626,663 203,675.4 PushPop :final matmul
0.2 36,142,801 4 9,035,700.3 8,277,425.5 5,727,053 13,860,897 3,973,407.0 PushPop :layer_21
0.2 35,761,577 4 8,940,394.3 6,869,224.0 5,853,060 16,170,069 4,910,824.5 PushPop :layer_2
0.2 34,757,470 4 8,689,367.5 7,636,290.5 5,780,838 13,704,051 3,736,557.0 PushPop :layer_18
0.2 34,523,872 4 8,630,968.0 8,070,350.0 5,800,627 12,582,545 3,372,567.1 PushPop :layer_12
0.2 34,080,434 4 8,520,108.5 6,754,163.5 5,854,342 14,717,765 4,205,817.4 PushPop :layer_19
0.1 33,513,083 4 8,378,270.8 6,998,136.0 5,849,813 13,666,998 3,656,560.1 PushPop :layer_6
0.1 32,801,401 4 8,200,350.3 6,888,301.5 5,932,104 13,092,694 3,297,104.7 PushPop :layer_8
0.1 32,565,199 4 8,141,299.8 7,240,857.0 5,838,080 12,245,405 2,816,447.1 PushPop :layer_7
0.1 31,837,704 4 7,959,426.0 6,523,703.0 5,848,630 12,941,668 3,381,872.5 PushPop :layer_14
0.1 31,498,641 4 7,874,660.3 6,475,794.0 5,838,922 12,708,131 3,269,957.7 PushPop :layer_3
0.1 31,348,507 4 7,837,126.8 6,649,840.0 5,891,345 12,157,482 2,950,555.2 PushPop :layer_9
0.1 31,231,769 4 7,807,942.3 6,621,113.5 5,791,900 12,197,642 3,017,970.3 PushPop :layer_20
0.1 31,131,060 4 7,782,765.0 6,795,294.0 5,920,692 11,619,780 2,616,144.6 PushPop :layer_17
0.1 31,050,682 4 7,762,670.5 6,469,591.5 5,834,423 12,277,076 3,065,818.9 PushPop :layer_4
0.1 30,847,456 4 7,711,864.0 6,501,799.5 5,850,986 11,992,871 2,905,318.3 PushPop :layer_5
0.1 30,766,828 4 7,691,707.0 6,546,382.0 5,847,429 11,826,635 2,826,192.3 PushPop :layer_11
0.1 30,737,009 4 7,684,252.3 6,848,839.0 5,787,211 11,252,120 2,431,495.3 PushPop :layer_16
0.1 30,503,764 4 7,625,941.0 6,613,569.0 5,754,397 11,522,229 2,716,611.7 PushPop :layer_22
0.1 30,480,818 4 7,620,204.5 6,521,132.5 5,763,384 11,675,169 2,760,575.6 PushPop :layer_15
0.1 30,471,028 4 7,617,757.0 6,584,306.5 5,849,422 11,452,993 2,624,857.7 PushPop :layer_10
0.1 29,872,717 4 7,468,179.3 5,907,241.0 5,856,306 12,201,929 3,155,980.2 PushPop :layer_1
0.1 20,584,393 128 160,815.6 156,551.5 144,513 249,629 16,515.2 PushPop :output projection
0.0 11,050,406 4 2,762,601.5 463,026.5 152,739 9,971,614 4,814,845.7 PushPop :lm_head
0.0 11,005,837 4 2,751,459.3 3,281,549.5 495,019 3,947,719 1,612,596.7 PushPop :final_norm
0.0 3,261,275 1 3,261,275.0 3,261,275.0 3,261,275 3,261,275 0.0 PushPop :loss_computation
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------------- ------------- --------- -------------- --------------- ----------------------
47.0 21,952,963,197 230 95,447,666.1 100,118,655.0 3,066 225,544,678 22,928,907.3 poll
43.6 20,373,469,766 13 1,567,189,982.0 972,600,948.0 133,942 10,186,269,697 2,688,018,597.0 pthread_cond_wait
4.2 1,944,993,312 4,133 470,600.8 184,401.0 1,001 21,478,101 750,696.6 read
2.6 1,214,626,381 3,715 326,951.9 47,663.0 1,092 55,529,039 2,437,572.0 ioctl
2.3 1,058,507,220 2,197 481,796.6 464,700.0 2,815 11,177,123 348,200.1 open64
0.1 40,133,388 17,511 2,291.9 1,964.0 1,804 761,171 6,002.0 mmap64
0.0 20,772,645 4,650 4,467.2 2,655.0 1,001 5,968,306 87,551.7 munmap
0.0 16,188,599 730 22,176.2 3,957.5 1,753 9,800,718 362,491.9 socket
0.0 15,074,102 98 153,817.4 115,446.0 67,833 2,227,251 222,921.1 pthread_create
0.0 12,955,655 15 863,710.3 5,010.0 1,302 12,870,659 3,321,626.7 open
0.0 12,783,328 113 113,126.8 1,774.0 1,092 11,896,882 1,118,648.7 fopen
0.0 10,122,870 2 5,061,435.0 5,061,435.0 5,061,039 5,061,831 560.0 nanosleep
0.0 7,425,546 665 11,166.2 4,238.0 1,593 2,608,507 102,132.6 mmap
0.0 4,026,846 1,707 2,359.0 1,583.0 1,001 26,492 2,137.1 write
0.0 3,638,160 65 55,971.7 56,120.0 39,176 69,045 4,399.7 sleep
0.0 1,932,443 3 644,147.7 905,352.0 3,416 1,023,675 558,034.8 fread
0.0 1,246,452 8 155,806.5 43,565.5 4,800 591,558 235,234.9 fopen64
0.0 1,134,247 74 15,327.7 12,925.0 1,182 58,615 9,676.5 fgets
0.0 457,155 2 228,577.5 228,577.5 51,030 406,125 251,090.1 pthread_cond_timedwait
0.0 456,465 10 45,646.5 34,322.0 17,154 109,525 34,022.9 sem_timedwait
0.0 355,771 146 2,436.8 1,172.5 1,002 11,863 2,576.0 pthread_cond_signal
0.0 235,762 1 235,762.0 235,762.0 235,762 235,762 0.0 pthread_mutex_lock
0.0 214,280 4 53,570.0 7,244.0 1,123 198,669 96,791.1 pthread_cond_broadcast
0.0 76,551 7 10,935.9 8,096.0 4,058 21,783 6,338.3 msgsnd
0.0 9,418 3 3,139.3 2,625.0 1,443 5,350 2,003.6 pipe2
0.0 6,452 4 1,613.0 1,493.0 1,312 2,154 383.4 fflush
0.0 5,440 2 2,720.0 2,720.0 2,284 3,156 616.6 sigaction
0.0 5,150 4 1,287.5 1,247.5 1,052 1,603 229.7 fcntl
0.0 4,298 2 2,149.0 2,149.0 1,422 2,876 1,028.1 fwrite
0.0 2,324 1 2,324.0 2,324.0 2,324 2,324 0.0 openat64
0.0 1,052 1 1,052.0 1,052.0 1,052 1,052 0.0 listen
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- -------- ----------- ------------- ---------------------------------
21.4 7,247,233,645 560 12,941,488.7 1,325,660.5 9,599 119,387,515 24,681,326.6 cudaMemcpyAsync
21.4 7,246,813,399 560 12,940,738.2 1,325,399.5 9,077 119,386,754 24,681,286.0 cudaMemcpyAsync
17.4 5,898,486,517 19,695 299,491.6 9,428.0 5,310 495,334,972 3,778,261.6 cudaLaunchKernel
17.4 5,879,592,231 19,695 298,532.2 9,067.0 5,100 495,334,451 3,777,467.9 cudaLaunchKernel
5.8 1,954,680,414 5 390,936,082.8 494,370,574.0 45,830 494,620,774 218,750,348.8 cudaDeviceSynchronize
5.8 1,954,675,694 5 390,935,138.8 494,369,412.0 44,447 494,619,652 218,750,577.8 cudaDeviceSynchronize
3.9 1,329,778,478 720 1,846,914.6 466,047.5 117,400 46,091,250 5,374,718.0 cudaMalloc
3.9 1,329,237,922 720 1,846,163.8 465,656.5 117,119 46,090,329 5,374,621.9 cudaMalloc
2.6 871,466,771 1,544 564,421.5 8,917.0 7,485 5,688,998 1,621,205.1 cuLaunchKernel
0.2 64,510,122 260 248,115.9 9,649.0 6,622 5,681,183 708,276.6 cudaMemsetAsync
0.2 64,437,317 260 247,835.8 8,892.5 6,432 5,680,892 708,289.4 cudaMemsetAsync
0.0 12,883,425 4 3,220,856.3 3,849,021.0 736,913 4,448,470 1,682,105.8 cuLibraryLoadData
0.0 7,134,899 2 3,567,449.5 3,567,449.5 130,686 7,004,213 4,860,317.6 cudaFree
0.0 7,133,016 2 3,566,508.0 3,566,508.0 130,245 7,002,771 4,859,609.7 cudaFree
0.0 4,369,272 292 14,963.3 16,242.0 2,094 65,689 7,732.7 cudaStreamSynchronize
0.0 3,879,440 292 13,285.8 14,408.5 1,894 56,441 6,777.7 cudaStreamSynchronize
0.0 1,702,864 716 2,378.3 1,433.0 501 26,021 2,049.2 cudaStreamIsCapturing_v10000
0.0 1,516,692 772 1,964.6 772.0 611 885,995 31,860.1 cuKernelGetFunction
0.0 961,944 36 26,720.7 421.0 410 927,456 154,417.6 cudaEventCreateWithFlags
0.0 941,753 2 470,876.5 470,876.5 450,201 491,552 29,239.6 cudaGetDeviceProperties_v2_v12000
0.0 253,757 784 323.7 281.0 110 6,513 298.7 cuGetProcAddress_v2
0.0 12,023 3 4,007.7 2,705.0 1,633 7,685 3,229.5 cuInit
0.0 6,813 5 1,362.6 1,563.0 832 1,723 419.8 cuLibraryGetKernel
0.0 2,175 2 1,087.5 1,087.5 682 1,493 573.5 cudaGetDriverEntryPoint_v11030
0.0 1,712 3 570.7 170.0 130 1,412 728.9 cuModuleGetLoadingMode
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- --------- ----------- ----------------------------------------------------------------------------------------------------
29.0 2,980,566,485 900 3,311,740.5 1,448,649.0 1,431,800 7,050,586 2,135,138.7 ampere_sgemm_128x64_tn
21.8 2,235,749,205 772 2,896,048.2 1,447,000.5 1,444,857 5,774,017 2,035,002.0 void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1>(T1::Params)
14.5 1,487,949,867 260 5,722,884.1 5,725,188.0 5,599,655 5,739,617 15,506.8 ampere_sgemm_128x32_sliced1x4_nn
8.3 853,988,820 768 1,111,964.6 1,423,960.5 479,774 1,438,456 442,264.6 ampere_sgemm_128x128_nn
7.1 731,375,952 128 5,713,874.6 5,712,915.5 5,708,065 5,761,569 6,157.3 ampere_sgemm_128x64_nn
7.1 727,644,706 128 5,684,724.3 5,683,878.5 5,682,278 5,743,521 6,273.2 ampere_sgemm_128x64_nt
1.9 197,378,632 2,064 95,629.2 41,008.0 1,919 234,687 74,285.4 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, floa…
1.7 179,205,434 931 192,487.0 171,327.0 86,752 372,990 110,611.7 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
1.2 122,742,918 256 479,464.5 479,357.0 478,877 484,510 537.1 ampere_sgemm_128x128_nt
1.0 103,020,714 2,634 39,111.9 30,400.0 2,176 237,246 52,548.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctor_add<float>, std::arr…
1.0 102,337,256 512 199,877.5 204,191.0 181,311 233,662 9,006.4 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.9 93,117,311 1,769 52,638.4 34,016.0 24,096 359,678 66,545.4 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.9 91,351,725 3,088 29,582.8 30,624.0 11,712 46,592 8,070.4 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.8 83,057,605 256 324,443.8 323,326.0 319,870 397,982 9,076.8 ampere_sgemm_128x128_tn
0.5 47,711,927 256 186,374.7 185,919.0 184,575 216,543 3,737.4 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.4 42,196,216 644 65,522.1 88,191.5 25,343 97,759 32,415.8 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapp…
0.4 39,022,817 256 152,432.9 152,415.0 148,127 159,423 3,297.3 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<float, float, floa…
0.4 38,268,484 256 149,486.3 149,967.0 145,279 154,495 2,994.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::neg_kernel_cuda(at::TensorIterat…
0.2 19,788,411 128 154,597.0 154,543.5 145,343 157,951 1,412.5 void at::native::vectorized_elementwise_kernel<(int)4, at::native::exp_kernel_cuda(at::TensorIterat…
0.2 18,542,500 128 144,863.3 144,911.0 142,463 145,951 483.2 void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_backward_kernel_cuda(at:…
0.1 14,484,187 128 113,157.7 112,543.5 111,903 129,887 2,961.6 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<flo…
0.1 11,130,151 128 86,954.3 86,879.0 86,239 89,087 453.3 void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_kernel_cuda(at::TensorIt…
0.1 10,504,807 408 25,747.1 2,016.0 1,728 76,320 31,970.6 void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, std::array<c…
0.1 5,542,934 260 21,319.0 21,312.0 19,903 25,376 510.8 void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
0.1 5,499,781 128 42,967.0 42,960.0 42,528 43,711 209.0 void at::native::<unnamed>::CatArrayBatchedCopy<at::native::<unnamed>::OpaqueType<(unsigned int)4>,…
0.1 5,171,306 260 19,889.6 19,872.0 18,848 20,512 201.6 void at::native::reduce_kernel<(int)128, (int)4, at::native::ReduceOp<float, at::native::func_wrapp…
0.0 5,097,957 520 9,803.8 9,328.0 1,920 20,480 7,428.6 void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<float, float, floa…
0.0 4,804,234 260 18,477.8 18,415.5 17,280 22,368 628.2 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MeanOps<fl…
0.0 4,712,746 260 18,125.9 18,112.0 17,984 18,368 64.7 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.0 4,690,950 256 18,324.0 18,080.0 17,344 20,128 832.3 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.0 3,374,223 128 26,361.1 26,336.0 25,728 27,520 336.4 void at::native::<unnamed>::CatArrayBatchedCopy_aligned16_contig<at::native::<unnamed>::OpaqueType<…
0.0 2,255,125 260 8,673.6 7,840.0 7,168 58,816 5,454.5 void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void …
0.0 1,481,783 128 11,576.4 11,584.0 10,784 12,800 362.5 void at::native::_scatter_gather_elementwise_kernel<(int)128, (int)4, void at::native::_cuda_scatte…
0.0 720,348 4 180,087.0 179,839.0 179,743 180,927 564.3 void at::native::<unnamed>::cunn_SoftMaxBackward<(int)4, float, float, float, at::native::<unnamed>…
0.0 677,307 260 2,605.0 2,592.0 2,272 3,040 125.2 void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
0.0 608,670 128 4,755.2 4,720.0 4,544 5,824 190.5 void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::n…
0.0 565,404 260 2,174.6 2,159.5 1,984 3,136 150.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctorOnSelf_add<float>, st…
0.0 536,861 260 2,064.8 2,048.0 1,792 2,912 187.4 void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…
0.0 517,631 260 1,990.9 1,984.0 1,920 2,464 79.7 void at::native::vectorized_elementwise_kernel<(int)4, at::native::rsqrt_kernel_cuda(at::TensorIter…
0.0 440,733 4 110,183.3 110,447.5 108,351 111,487 1,475.3 void at::native::<unnamed>::cunn_SoftMaxForwardSmem<(int)4, float, float, float, at::native::<unnam…
0.0 275,103 4 68,775.8 68,815.5 68,160 69,312 622.1 void at::native::<unnamed>::nll_loss_forward_reduce_cuda_kernel_2d<float, float, long>(T1 *, T1 *, …
0.0 162,847 4 40,711.8 40,768.0 40,256 41,055 348.6 void <unnamed>::indexing_backward_kernel<float, (int)4>(const long *, const long *, const T1 *, T1 …
0.0 157,791 4 39,447.8 39,504.0 39,167 39,616 205.1 void at::native::<unnamed>::nll_loss_backward_reduce_cuda_kernel_2d<float, long>(T1 *, const T1 *, …
0.0 42,048 4 10,512.0 10,496.0 10,432 10,624 84.7 void at_cuda_detail::cub::DeviceRadixSortSingleTileKernel<at_cuda_detail::cub::DeviceRadixSortPolic…
0.0 16,064 4 4,016.0 4,016.0 3,712 4,320 251.3 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.0 12,736 4 3,184.0 3,232.0 2,944 3,328 183.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<long, long, long, …
0.0 10,368 4 2,592.0 2,592.0 2,592 2,592 0.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<long, long, long, …
0.0 9,600 4 2,400.0 2,384.0 2,336 2,496 69.1 void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<long, long, long, …
0.0 3,488 1 3,488.0 3,488.0 3,488 3,488 0.0 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<unsigned int, (int)4, void …
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- ------------ ------------ -------- ----------- ------------ ------------------------------
99.9 7,007,930,722 292 23,999,762.7 15,852,378.5 1,888 118,282,014 29,602,623.7 [CUDA memcpy Host-to-Device]
0.1 7,517,185 268 28,049.2 25,440.0 24,896 119,295 13,996.3 [CUDA memcpy Device-to-Device]
0.0 548,605 260 2,110.0 2,112.0 1,024 2,528 80.4 [CUDA memset]
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ------------------------------
13,627.402 292 46.669 26.214 0.010 104.858 42.616 [CUDA memcpy Host-to-Device]
6,189.235 268 23.094 20.972 20.972 102.400 12.254 [CUDA memcpy Device-to-Device]
0.021 260 0.000 0.000 0.000 0.000 0.000 [CUDA memset]
Generated:
/nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_backward.nsys-rep
/nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_backward.sqlite
Saved: nsys_results/2.7B_ctx512_backward.nsys-rep
Profiling: size=2.7B, context_length=512, type=train
Collecting data...
Using device: cuda
Model configuration:
Size: 2.7B
d_model: 2560
d_ff: 10240
num_layers: 32
num_heads: 32
vocab_size: 10000
context_length: 512
batch_size: 4
Initializing model...
Creating random batch...
Running profiled step 1/1...
Profiling complete.
Generating '/tmp/nsys-report-477e.qdstrm'
[1/8] [========================100%] 2.7B_ctx512_train.nsys-rep
Processing 339233 events:
[2/8] [========================100%] 2.7B_ctx512_train.sqlite
[3/8] Executing 'nvtx_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Style Range
-------- --------------- --------- --------------- --------------- ------------- ------------- ------------- ------- ---------------------------------------
43.2 9,866,325,771 1 9,866,325,771.0 9,866,325,771.0 9,866,325,771 9,866,325,771 0.0 PushPop :warmup
12.1 2,758,088,295 1 2,758,088,295.0 2,758,088,295.0 2,758,088,295 2,758,088,295 0.0 PushPop :profiled_step
8.3 1,896,613,679 128 14,817,294.4 5,119,042.5 4,546,131 859,285,061 75,684,546.2 PushPop :attention_sublayer
7.6 1,732,475,648 1 1,732,475,648.0 1,732,475,648.0 1,732,475,648 1,732,475,648 0.0 PushPop :backward_pass
4.0 903,231,175 4 225,807,793.8 7,524,818.5 7,188,525 880,993,013 436,790,217.6 PushPop :layer_0
3.7 843,662,433 1 843,662,433.0 843,662,433.0 843,662,433 843,662,433 0.0 PushPop :forward_pass
3.1 701,380,211 128 5,479,532.9 1,354,732.5 1,204,849 21,647,074 6,535,892.6 PushPop :ffn_sublayer
2.1 489,582,879 128 3,824,866.2 1,453,305.0 1,348,942 119,793,927 11,735,299.0 PushPop :scaled_dot_product_attention
2.1 488,840,469 128 3,819,066.2 1,448,836.5 1,345,043 119,781,121 11,734,047.3 PushPop :annotated_scaled_dot_product_attention
1.6 356,320,108 128 2,783,750.8 1,740,838.0 1,634,820 100,288,567 8,713,326.4 PushPop :rope encoding
1.4 329,760,873 128 2,576,256.8 371,437.5 339,595 183,266,201 16,763,493.9 PushPop :attention projections
0.9 195,493,826 128 1,527,295.5 379,047.5 340,877 44,298,402 5,474,557.1 PushPop :computing attention scores
0.8 178,760,308 1 178,760,308.0 178,760,308.0 178,760,308 178,760,308 0.0 PushPop :optimizer_step
0.7 163,841,492 128 1,280,011.7 424,596.5 386,888 50,208,544 4,438,987.6 PushPop :computing softmax
0.6 140,356,919 4 35,089,229.8 26,119,665.5 26,085,203 62,032,385 17,962,112.7 PushPop :layer_27
0.5 105,233,309 4 26,308,327.3 26,052,554.0 25,826,967 27,301,234 671,355.4 PushPop :layer_26
0.4 96,421,866 4 24,105,466.5 26,089,967.0 18,037,847 26,204,085 4,045,466.9 PushPop :layer_28
0.4 95,504,771 4 23,876,192.8 26,059,161.5 17,172,503 26,213,945 4,469,902.8 PushPop :layer_29
0.4 92,776,559 4 23,194,139.8 26,062,102.5 14,544,517 26,107,837 5,766,560.0 PushPop :layer_30
0.4 92,234,238 4 23,058,559.5 26,032,685.0 14,091,631 26,077,237 5,978,031.5 PushPop :layer_31
0.4 90,672,651 4 22,668,162.8 26,121,999.5 12,256,448 26,172,204 6,941,258.7 PushPop :layer_24
0.4 90,165,106 4 22,541,276.5 26,143,963.0 11,440,600 26,436,580 7,401,742.5 PushPop :layer_25
0.3 78,305,758 128 611,763.7 197,978.0 186,926 40,472,612 3,562,877.8 PushPop :causal mask construction
0.3 75,908,733 4 18,977,183.3 6,055,561.5 6,021,048 57,776,562 25,866,268.5 PushPop :layer_13
0.3 70,968,732 4 17,742,183.0 6,057,921.5 5,969,237 52,883,652 23,427,731.1 PushPop :layer_8
0.3 69,634,238 4 17,408,559.5 5,969,758.0 5,930,191 51,764,531 22,903,993.2 PushPop :layer_17
0.3 66,908,110 4 16,727,027.5 17,452,280.0 12,234,965 19,768,585 3,364,546.9 PushPop :layer_23
0.2 52,802,349 128 412,518.4 377,082.5 336,238 2,132,946 168,418.0 PushPop :final matmul
0.2 49,097,929 4 12,274,482.3 419,506.0 328,904 47,930,013 23,770,392.6 PushPop :token_embeddings
0.2 38,027,865 4 9,506,966.3 6,234,541.5 5,954,288 19,604,494 6,736,513.5 PushPop :layer_15
0.2 34,966,762 4 8,741,690.5 6,427,219.0 5,912,947 16,199,377 4,992,101.7 PushPop :layer_5
0.2 34,304,385 4 8,576,096.3 6,080,630.5 5,899,981 16,243,143 5,113,632.9 PushPop :layer_4
0.1 34,070,137 4 8,517,534.3 5,947,304.5 5,883,239 16,292,289 5,183,374.5 PushPop :layer_14
0.1 32,848,815 4 8,212,203.8 5,941,763.5 5,902,877 15,062,411 4,566,877.7 PushPop :layer_20
0.1 32,502,055 4 8,125,513.8 5,998,494.0 5,826,338 14,678,729 4,370,046.7 PushPop :layer_19
0.1 32,417,751 4 8,104,437.8 6,533,502.0 5,935,090 13,415,657 3,560,959.5 PushPop :layer_6
0.1 32,416,460 4 8,104,115.0 6,506,584.5 6,014,687 13,388,604 3,550,982.2 PushPop :layer_1
0.1 31,808,618 4 7,952,154.5 5,921,714.0 5,891,485 14,073,705 4,081,060.2 PushPop :layer_16
0.1 31,789,241 4 7,947,310.3 6,009,927.0 5,963,225 13,806,162 3,906,027.9 PushPop :layer_7
0.1 31,706,038 4 7,926,509.5 6,383,368.0 5,973,456 12,965,846 3,370,600.0 PushPop :layer_2
0.1 31,372,556 4 7,843,139.0 5,990,870.0 5,851,477 13,539,339 3,798,885.1 PushPop :layer_9
0.1 31,361,794 4 7,840,448.5 5,970,555.0 5,951,453 13,469,231 3,752,534.3 PushPop :layer_18
0.1 30,846,866 4 7,711,716.5 5,997,508.0 5,889,451 12,962,399 3,501,390.9 PushPop :layer_11
0.1 30,595,954 4 7,648,988.5 6,079,518.0 5,939,329 12,497,589 3,234,363.9 PushPop :layer_21
0.1 30,432,163 4 7,608,040.8 6,164,018.5 5,884,161 12,219,965 3,079,599.0 PushPop :layer_12
0.1 30,313,833 4 7,578,458.3 6,130,974.5 5,885,794 12,166,090 3,065,597.2 PushPop :layer_3
0.1 29,851,817 4 7,462,954.3 6,027,401.0 5,859,061 11,937,954 2,984,435.2 PushPop :layer_22
0.1 29,439,851 4 7,359,962.8 5,987,648.5 5,851,296 11,613,258 2,836,778.7 PushPop :layer_10
0.1 21,192,761 128 165,568.4 158,084.5 146,446 260,840 20,724.3 PushPop :output projection
0.1 12,491,567 4 3,122,891.8 3,934,268.0 608,000 4,015,031 1,677,064.7 PushPop :final_norm
0.0 3,113,887 1 3,113,887.0 3,113,887.0 3,113,887 3,113,887 0.0 PushPop :loss_computation
0.0 1,522,931 4 380,732.8 167,282.5 158,440 1,029,926 432,842.9 PushPop :lm_head
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------------- --------------- --------- -------------- --------------- ----------------------
51.8 22,486,241,609 12 1,873,853,467.4 1,174,244,972.5 159,071 11,242,482,846 3,031,915,471.2 pthread_cond_wait
38.8 16,855,550,195 179 94,165,084.9 100,119,887.0 2,615 232,676,522 26,328,365.5 poll
4.4 1,925,983,440 4,468 431,061.6 122,855.5 1,002 7,254,614 526,839.0 read
2.5 1,095,783,652 2,197 498,763.6 490,690.0 2,986 11,305,435 350,983.0 open64
2.0 878,762,723 4,392 200,082.6 50,429.0 1,002 40,327,698 1,790,688.9 ioctl
0.1 41,107,956 17,513 2,347.3 1,994.0 1,814 596,387 5,073.7 mmap64
0.0 16,781,731 98 171,242.2 117,013.5 82,622 3,196,468 336,337.1 pthread_create
0.0 14,936,411 5,746 2,599.4 1,423.0 1,001 56,450 3,102.5 munmap
0.0 13,846,094 900 15,384.5 3,948.0 2,135 6,482,663 215,944.1 socket
0.0 12,498,781 113 110,608.7 1,973.0 1,082 11,304,413 1,062,900.9 fopen
0.0 12,419,584 15 827,972.3 6,262.0 1,172 12,330,732 3,182,146.8 open
0.0 10,076,940 2 5,038,470.0 5,038,470.0 5,017,935 5,059,005 29,040.9 nanosleep
0.0 8,122,423 65 124,960.4 56,450.0 39,517 4,049,118 495,490.0 sleep
0.0 5,727,743 2,384 2,402.6 1,613.0 1,001 40,921 2,411.9 write
0.0 5,676,611 841 6,749.8 4,439.0 1,262 49,998 5,138.8 mmap
0.0 1,613,690 3 537,896.7 701,854.0 3,717 908,119 473,969.7 fread
0.0 1,170,573 8 146,321.6 41,331.0 5,010 615,385 223,847.6 fopen64
0.0 1,136,181 74 15,353.8 12,820.0 1,172 58,634 9,813.4 fgets
0.0 516,992 11 46,999.3 30,880.0 14,348 235,521 63,615.9 sem_timedwait
0.0 453,798 203 2,235.5 1,283.0 1,002 14,218 2,293.0 pthread_cond_signal
0.0 116,238 2 58,119.0 58,119.0 24,198 92,040 47,971.5 pthread_cond_timedwait
0.0 70,900 7 10,128.6 10,711.0 5,701 13,447 3,154.7 msgsnd
0.0 21,923 1 21,923.0 21,923.0 21,923 21,923 0.0 pthread_mutex_lock
0.0 14,539 3 4,846.3 5,461.0 1,593 7,485 2,993.7 pipe2
0.0 11,442 7 1,634.6 1,733.0 1,122 2,125 325.5 fcntl
0.0 5,701 2 2,850.5 2,850.5 2,234 3,467 871.9 sigaction
0.0 5,551 2 2,775.5 2,775.5 1,533 4,018 1,757.2 fwrite
0.0 4,979 3 1,659.7 1,873.0 1,142 1,964 450.6 fflush
0.0 4,548 2 2,274.0 2,274.0 1,092 3,456 1,671.6 pthread_cond_broadcast
0.0 2,365 1 2,365.0 2,365.0 2,365 2,365 0.0 openat64
0.0 1,182 1 1,182.0 1,182.0 1,182 1,182 0.0 pthread_mutex_trylock
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ------------- ------------- ---------- ----------- ------------- ---------------------------------
32.7 7,699,062,535 25,295 304,370.9 9,278.0 4,599 495,402,097 3,601,347.1 cudaLaunchKernel
32.6 7,688,218,088 25,295 303,942.2 8,957.0 4,408 495,401,787 3,601,310.4 cudaLaunchKernel
6.8 1,590,651,814 560 2,840,449.7 228,522.5 7,394 24,446,695 5,266,448.2 cudaMemcpyAsync
6.8 1,590,368,607 560 2,839,943.9 228,267.0 7,084 24,446,044 5,266,294.5 cudaMemcpyAsync
6.0 1,404,040,027 6 234,006,671.2 125,623,731.5 59,392,445 496,968,134 195,461,349.5 cudaDeviceSynchronize
6.0 1,404,036,261 6 234,006,043.5 125,623,425.5 59,391,975 496,966,310 195,460,975.3 cudaDeviceSynchronize
4.1 962,781,997 890 1,081,777.5 535,348.0 283,475 45,822,988 3,747,493.7 cudaMalloc
4.1 961,965,203 890 1,080,859.8 534,226.0 282,143 45,822,347 3,746,358.4 cudaMalloc
0.3 75,979,762 1,544 49,209.7 8,812.5 7,415 5,644,171 320,920.8 cuLaunchKernel
0.3 71,308,169 260 274,262.2 110,336.0 6,302 5,737,494 686,387.2 cudaMemsetAsync
0.3 71,241,386 260 274,005.3 110,105.0 6,012 5,737,223 686,386.7 cudaMemsetAsync
0.0 11,032,542 4 2,758,135.5 3,403,764.5 630,926 3,594,087 1,421,895.7 cuLibraryLoadData
0.0 6,668,908 2 3,334,454.0 3,334,454.0 369,815 6,299,093 4,192,632.7 cudaFree
0.0 6,665,871 2 3,332,935.5 3,332,935.5 367,800 6,298,071 4,193,334.8 cudaFree
0.0 3,951,073 292 13,531.1 14,604.0 2,084 33,546 5,919.0 cudaStreamSynchronize
0.0 3,588,531 292 12,289.5 13,036.0 1,804 31,662 5,429.9 cudaStreamSynchronize
0.0 1,935,776 890 2,175.0 1,433.0 481 30,890 1,908.3 cudaStreamIsCapturing_v10000
0.0 1,530,948 772 1,983.1 752.0 601 850,335 30,583.2 cuKernelGetFunction
0.0 941,282 2 470,641.0 470,641.0 462,766 478,516 11,136.9 cudaGetDeviceProperties_v2_v12000
0.0 867,526 36 24,097.9 446.0 411 812,350 135,184.7 cudaEventCreateWithFlags
0.0 199,995 784 255.1 190.0 110 1,503 154.6 cuGetProcAddress_v2
0.0 6,353 3 2,117.7 1,763.0 1,243 3,347 1,095.9 cuInit
0.0 5,982 5 1,196.4 992.0 792 2,204 574.9 cuLibraryGetKernel
0.0 1,833 2 916.5 916.5 571 1,262 488.6 cudaGetDriverEntryPoint_v11030
0.0 1,392 3 464.0 190.0 150 1,052 509.6 cuModuleGetLoadingMode
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- --------- ----------- ----------------------------------------------------------------------------------------------------
27.3 3,002,482,417 900 3,336,091.6 1,453,497.5 1,432,568 7,054,938 2,156,094.1 ampere_sgemm_128x64_tn
20.3 2,235,157,813 772 2,895,282.1 1,446,522.0 1,444,856 5,804,966 2,034,343.4 void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1>(T1::Params)
13.5 1,487,779,391 260 5,722,228.4 5,723,684.5 5,599,298 5,744,839 15,668.3 ampere_sgemm_128x32_sliced1x4_nn
7.8 854,356,834 768 1,112,443.8 1,423,609.0 480,350 1,438,010 441,460.3 ampere_sgemm_128x128_nn
6.6 731,264,017 128 5,713,000.1 5,711,075.5 5,708,390 5,767,143 8,255.9 ampere_sgemm_128x64_nn
6.6 727,675,106 128 5,684,961.8 5,683,556.0 5,681,734 5,742,566 7,505.3 ampere_sgemm_128x64_nt
2.0 219,552,934 2,444 89,833.4 36,608.0 24,128 338,974 103,994.8 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
1.8 197,551,348 2,064 95,712.9 41,424.0 1,888 235,102 74,258.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, floa…
1.3 143,804,577 1,304 110,279.6 110,304.0 71,264 121,024 3,919.4 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
1.2 130,989,437 652 200,904.0 200,959.0 131,615 213,407 5,854.9 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListScalarListMe…
1.1 122,711,755 256 479,342.8 479,230.0 478,877 481,950 445.4 ampere_sgemm_128x128_nt
0.9 102,490,391 512 200,176.5 204,207.0 181,567 233,983 9,393.3 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.9 102,152,443 2,436 41,934.5 30,560.0 11,488 237,759 53,409.9 void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctor_add<float>, std::arr…
0.9 102,036,288 652 156,497.4 156,575.0 102,623 165,471 4,821.4 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
0.9 101,626,725 652 155,869.2 155,999.5 101,600 164,735 4,865.5 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
0.8 91,768,948 3,088 29,717.9 30,655.0 11,648 47,808 8,207.7 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.8 83,350,734 256 325,588.8 323,327.0 320,446 398,014 12,688.0 ampere_sgemm_128x128_tn
0.7 73,081,191 652 112,087.7 112,128.0 72,640 121,696 3,779.5 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListScalarListMe…
0.7 72,100,619 652 110,583.8 110,559.5 70,656 121,343 3,893.2 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
0.7 71,606,898 652 109,826.5 110,112.0 71,488 117,824 3,681.8 void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
0.4 47,817,595 256 186,787.5 185,823.0 184,735 216,255 5,259.4 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.4 44,066,285 256 172,133.9 171,455.5 169,471 191,391 3,494.3 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.4 42,214,467 644 65,550.4 88,591.0 25,024 99,487 32,658.3 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapp…
0.4 39,057,413 256 152,568.0 152,431.0 147,999 158,719 3,254.9 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<float, float, floa…
0.3 38,212,459 256 149,267.4 149,663.5 145,407 154,303 2,952.5 void at::native::vectorized_elementwise_kernel<(int)4, at::native::neg_kernel_cuda(at::TensorIterat…
0.2 25,162,498 990 25,416.7 14,880.0 1,664 76,160 26,167.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, std::array<c…
0.2 19,800,351 128 154,690.2 154,735.5 145,503 157,567 1,301.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::exp_kernel_cuda(at::TensorIterat…
0.2 18,556,130 128 144,969.8 144,991.5 140,799 146,047 562.8 void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_backward_kernel_cuda(at:…
0.1 14,518,616 128 113,426.7 112,288.0 111,392 130,015 4,182.0 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<flo…
0.1 11,134,852 128 86,991.0 86,848.0 86,335 89,631 625.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_kernel_cuda(at::TensorIt…
0.1 5,566,435 260 21,409.4 21,312.0 19,392 26,208 565.0 void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
0.1 5,510,950 128 43,054.3 43,072.0 42,592 43,616 214.8 void at::native::<unnamed>::CatArrayBatchedCopy<at::native::<unnamed>::OpaqueType<(unsigned int)4>,…
0.0 5,230,759 260 20,118.3 20,128.0 18,976 20,799 242.2 void at::native::reduce_kernel<(int)128, (int)4, at::native::ReduceOp<float, at::native::func_wrapp…
0.0 5,217,579 520 10,033.8 9,376.0 1,888 20,608 7,635.3 void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<float, float, floa…
0.0 4,816,299 260 18,524.2 18,431.5 17,311 22,560 794.4 void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MeanOps<fl…
0.0 4,718,019 260 18,146.2 18,144.0 17,984 18,912 85.1 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.0 4,711,592 256 18,404.7 18,000.0 17,312 20,224 919.3 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.0 3,382,639 128 26,426.9 26,399.0 25,760 28,096 363.9 void at::native::<unnamed>::CatArrayBatchedCopy_aligned16_contig<at::native::<unnamed>::OpaqueType<…
0.0 2,235,989 260 8,600.0 7,776.0 7,136 58,208 5,342.0 void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void …
0.0 1,470,523 128 11,488.5 11,424.0 10,624 12,448 334.7 void at::native::_scatter_gather_elementwise_kernel<(int)128, (int)4, void at::native::_cuda_scatte…
0.0 718,396 4 179,599.0 179,471.0 178,847 180,607 733.7 void at::native::<unnamed>::cunn_SoftMaxBackward<(int)4, float, float, float, at::native::<unnamed>…
0.0 676,317 260 2,601.2 2,560.0 2,336 3,072 135.2 void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
0.0 601,178 128 4,696.7 4,640.0 4,448 5,760 278.6 void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::n…
0.0 567,356 260 2,182.1 2,144.0 2,016 2,816 152.1 void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctorOnSelf_add<float>, st…
0.0 532,636 260 2,048.6 2,048.0 1,760 3,008 176.7 void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…
0.0 521,918 260 2,007.4 1,984.0 1,888 2,560 111.7 void at::native::vectorized_elementwise_kernel<(int)4, at::native::rsqrt_kernel_cuda(at::TensorIter…
0.0 437,374 4 109,343.5 109,343.5 109,023 109,664 273.3 void at::native::<unnamed>::cunn_SoftMaxForwardSmem<(int)4, float, float, float, at::native::<unnam…
0.0 275,134 4 68,783.5 69,039.0 67,680 69,376 758.2 void at::native::<unnamed>::nll_loss_forward_reduce_cuda_kernel_2d<float, float, long>(T1 *, T1 *, …
0.0 165,983 4 41,495.8 40,992.0 40,480 43,519 1,425.2 void <unnamed>::indexing_backward_kernel<float, (int)4>(const long *, const long *, const T1 *, T1 …
0.0 156,287 4 39,071.8 39,072.0 37,951 40,192 920.8 void at::native::<unnamed>::nll_loss_backward_reduce_cuda_kernel_2d<float, long>(T1 *, const T1 *, …
0.0 42,240 4 10,560.0 10,576.0 10,432 10,656 94.2 void at_cuda_detail::cub::DeviceRadixSortSingleTileKernel<at_cuda_detail::cub::DeviceRadixSortPolic…
0.0 15,776 4 3,944.0 3,936.0 3,744 4,160 177.9 void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
0.0 12,704 4 3,176.0 3,232.0 3,008 3,232 112.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<long, long, long, …
0.0 10,400 4 2,600.0 2,592.0 2,592 2,624 16.0 void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<long, long, long, …
0.0 9,408 4 2,352.0 2,352.0 2,336 2,368 18.5 void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<long, long, long, …
0.0 3,552 1 3,552.0 3,552.0 3,552 3,552 0.0 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<unsigned int, (int)4, void …
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- ----------- ----------- -------- ---------- ----------- ------------------------------
99.5 1,509,378,340 292 5,169,103.9 1,929,510.0 1,888 24,205,822 6,308,501.9 [CUDA memcpy Host-to-Device]
0.5 7,530,656 268 28,099.5 25,503.0 24,896 120,959 14,041.7 [CUDA memcpy Device-to-Device]
0.0 555,642 260 2,137.1 2,144.0 1,056 2,688 87.4 [CUDA memset]
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ------------------------------
13,627.402 292 46.669 26.214 0.010 104.858 42.616 [CUDA memcpy Host-to-Device]
6,189.235 268 23.094 20.972 20.972 102.400 12.254 [CUDA memcpy Device-to-Device]
0.021 260 0.000 0.000 0.000 0.000 0.000 [CUDA memset]
Generated:
/nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_train.nsys-rep
/nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_train.sqlite
Saved: nsys_results/2.7B_ctx512_train.nsys-rep这是一个非常详尽的NVIDIA Nsight Systems性能分析报告。这份报告对一个具有27亿参数、上下文长度为512的类GPT模型进行了三种不同场景的剖析。
首先,我们来解释一下每个测试测的是什么:
1. 测试内容说明
这三个测试对应了深度学习模型训练过程中的核心步骤:
Forward (前向传播):这是模型根据输入数据计算预测结果的过程。在这个阶段,数据从输入层流向输出层,不涉及梯度计算和权重更新。它主要衡量模型推理或预测时的计算负担和耗时。
Backward (反向传播):这是计算梯度的过程。在前向传播得到预测结果并计算出误差(Loss)后,反向传播从输出层开始,将误差梯度逐层传回输入层,计算出每个参数对最终误差的贡献程度(即梯度)。它衡量的是计算梯度的计算负担。
Train (训练-完整步骤):这是前向传播、反向传播和优化器更新(Optimizer Step) 的完整组合。它代表了实际训练模型时一个完整迭代(Step) 的开销,是衡量训练吞吐量(Throughput)最直接的指标。
2. 报告结果总结
这份报告揭示了几个关于大模型训练和推理的、非常典型且重要的性能特征。我们可以从整体耗时、计算瓶颈、内存操作等几个维度来总结。
2.1 整体耗时与相对成本
首先,我们来对比一下三个测试的总耗时(基于 profiled_step
和主要阶段的数据):
| 测试类型 | 核心步骤耗时 (profiled_step) | 前向耗时 (forward_pass) | 反向耗时 (backward_pass) | 优化器耗时 (optimizer_step) |
|---|---|---|---|---|
| Forward | ~845 ms | ~845 ms | N/A | N/A |
| Backward | ~2,581 ms | ~843 ms | ~1,734 ms | N/A |
| Train | ~2,758 ms | ~844 ms | ~1,732 ms | ~179 ms |
关键洞察:
- 反向传播是前向传播的2倍开销:从
Backward和Train测试中可以清晰地看到,对于一个完整的训练步骤,backward_pass(~1734 ms) 的时间大约是forward_pass(~844 ms) 的 2.05倍。这印证了深度学习中一个普遍的经验法则:反向传播的计算量和内存访问量通常是前向传播的2-3倍。 - 优化器步骤相对开销较小:优化器更新(
optimizer_step)只占了整个训练步骤的约 6.5% (179ms / 2758ms)。这表明在当前设置下,主要的性能瓶颈在于模型的前向和反向计算本身,而不是参数更新。
2.2 性能瓶颈分析:什么在耗时?
通过查看 NVTX
摘要(nvtx_sum),我们可以分解模型内部各部分的耗时(以
Train 测试为例):
| 组件 (Range) | 耗时 (ns) | 占比 (在 profiled_step 中) | 说明 |
|---|---|---|---|
profiled_step |
2,758,088,295 | 100% | 一个完整训练步 |
forward_pass |
843,662,433 | 30.6% | 前向传播 |
backward_pass |
1,732,475,648 | 62.8% | 反向传播 |
optimizer_step |
178,760,308 | 6.5% | 参数更新 |
attention_sublayer |
1,896,613,679 | 68.8% | 前向和反向中Attention的总耗时 |
ffn_sublayer |
701,380,211 | 25.4% | 前向和反向中FFN的总耗时 |
关键洞察:
- 注意力机制是绝对的计算热点:无论是前向还是反向,
attention_sublayer都占据了主导地位。在完整的训练步中,它占用了近 69% 的时间,而ffn_sublayer只占 25%。这凸显了在长序列(ctx=512)下,Attention机制 O(n²) 计算复杂度的巨大影响。 - 层间耗时存在差异:报告中详细列出了每一层(
layer_0到layer_31)的耗时。可以发现,layer_0的耗时(特别是反向传播时)明显高于其他层。这通常是因为第一层直接处理输入数据,其梯度计算可能涉及额外的数据重排布或与输入张量相关的操作。
2.3 GPU内核分析:底层在做什么?
CUDA GPU
内核摘要(cuda_gpu_kern_sum)揭示了底层最耗时的数学计算,以
Train 测试为例:
| GPU 内核 (Name) | 耗时占比 | 说明 |
|---|---|---|
ampere_sgemm_128x64_tn |
27.3% | 单精度通用矩阵乘法 (SGEMM) |
void cutlass::Kernel2<...sgemm...> |
20.3% | 由 CUTLASS 库优化的矩阵乘法 |
ampere_sgemm_128x32_sliced1x4_nn |
13.5% | 另一种形状的矩阵乘法 |
ampere_sgemm_128x128_nn |
7.8% | 矩阵乘法 |
ampere_sgemm_128x64_nn |
6.6% | 矩阵乘法 |
...elementwise_kernel... 和
...vectorized... |
~5-7% | 逐元素操作 (如激活函数、Dropout、残差连接等) |
关键洞察:
- 矩阵乘法(GEMM)是绝对的算力核心:前几项都是
sgemm(单精度通用矩阵乘法)相关的内核。将所有sgemm和相关内核的时间加起来,它们占据了 75% 以上 的 GPU 计算时间。这说明该模型的性能严重依赖于 GPU 的 Tensor Core 和 FP32 算力。 - 逐元素操作的开销不可忽视:虽然单个逐元素操作很快,但它们在模型中数量庞大。大量的
elementwise_kernel调用累加起来也占用了约 5-10% 的时间。这些操作通常受限于内存带宽而非计算能力。
2.4 内存操作分析
从 cuda_gpu_mem_time_sum 和
cuda_gpu_mem_size_sum 报告中可以看出:
- Host-to-Device 内存拷贝稳定:在所有三个测试中,Host-to-Device 的内存拷贝总量都是 ~13.6 GB,且都是在测试开始时发生的。这对应了将模型参数和输入数据(随机生成的批次) 从 CPU 内存拷贝到 GPU 显存的过程。一旦拷贝完成,后续的计算就不再需要大量的 CPU-GPU 通信。
- Device-to-Device 拷贝只在有反向传播时出现:在
Backward和Train测试中,出现了约 6.2 GB 的 Device-to-Device 内存拷贝。这通常发生在反向传播过程中,用于梯度累加、数据重排布或中间结果的暂存。 - 内存拷贝耗时占比较小:对比
cuda_gpu_mem_time_sum和cuda_gpu_kern_sum的时间可以看出,内核执行时间(秒级)远远超过了内存拷贝时间(毫秒级)。这表明该训练过程是计算密集型而非数据/通信密集型的。
3. 总体结论
综合以上分析,可以得出以下结论:
- 瓶颈确认:该 2.7B 参数模型在上下文为512的训练中,是典型的计算密集型负载。主要瓶颈在于 GPU 的矩阵乘法算力,其次是内存带宽。
- Attention 是关键:
attention_sublayer是最大的性能瓶颈,占据了近70%的训练时间。任何针对 Attention 机制的优化(如 FlashAttention)都将带来显著的性能提升。 - 反向传播开销巨大:反向传播的时间是前向传播的2倍以上,是训练成本的主要构成部分。
- 优化空间:
- 启用混合精度训练(Mixed Precision / AMP):当前使用单精度(FP32)训练。切换到 FP16 或 BF16 并使用 Tensor Core 将能成倍地加速矩阵乘法(GEMM)内核,这是最有效的优化手段。
- 使用 FlashAttention:报告中未显示使用了
FlashAttention,其标准的
scaled_dot_product_attention实现非常耗时。FlashAttention 通过优化内存访问模式,可以大幅降低 Attention 层的耗时和显存占用。 - 算子融合(Operator Fusion):报告中大量的
elementwise_kernel表明存在许多细碎的操作。通过编译器优化或手动实现融合内核(如在 TransformerEngine 或 torch.compile 中),可以减少内核启动开销和内存访问次数。




