cs336-03-性能分析

gogongxt2026-04-102026-04-10
测试报告

下面以测试2.7B模型，上下文长度512，分别测试forward，forward+backward，train为例
==========================================
Model: 2.7B, Context Length: 512
==========================================
Profiling: size=2.7B, context_length=512, type=forward
Collecting data...
Using device: cuda

Model configuration:
  Size: 2.7B
  d_model: 2560
  d_ff: 10240
  num_layers: 32
  num_heads: 32
  vocab_size: 10000
  context_length: 512
  batch_size: 4

Initializing model...
Creating random batch...

Running profiled step 1/1...

Profiling complete.
Generating '/tmp/nsys-report-eba2.qdstrm'
[1/8] [========================100%] 2.7B_ctx512_forward.nsys-rep
Processing 259647 events:
[2/8] [========================100%] 2.7B_ctx512_forward.sqlite
[3/8] Executing 'nvtx_sum' stats report

 Time (%)  Total Time (ns)  Instances     Avg (ns)         Med (ns)        Min (ns)       Max (ns)      StdDev (ns)    Style                    Range
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  -------------  -------  ---------------------------------------
     27.4    3,509,732,403          1  3,509,732,403.0  3,509,732,403.0  3,509,732,403  3,509,732,403            0.0  PushPop  :warmup
     15.3    1,951,345,647        128     15,244,887.9      8,557,875.0      4,580,127    867,075,963   75,970,548.0  PushPop  :attention_sublayer
      7.3      932,033,009          4    233,008,252.3     17,638,398.5      9,106,138    887,650,074  436,477,614.6  PushPop  :layer_0
      6.6      844,667,951          1    844,667,951.0    844,667,951.0    844,667,951    844,667,951            0.0  PushPop  :profiled_step
      6.6      844,640,598          1    844,640,598.0    844,640,598.0    844,640,598    844,640,598            0.0  PushPop  :forward_pass
      5.1      652,927,547        128      5,100,996.5      3,539,440.0      1,209,669     20,497,491    5,289,676.2  PushPop  :ffn_sublayer
      4.2      532,425,153        128      4,159,571.5      2,589,074.5      1,357,137    132,770,103   11,622,481.6  PushPop  :scaled_dot_product_attention
      4.2      531,599,596        128      4,153,121.8      2,584,390.0      1,353,269    132,742,540   11,620,311.7  PushPop  :annotated_scaled_dot_product_attention
      3.0      386,992,835        128      3,023,381.5      2,160,590.0      1,661,082     92,582,272    8,013,505.7  PushPop  :rope encoding
      2.3      297,347,995        128      2,323,031.2        925,472.0        342,711    190,186,456   16,748,871.4  PushPop  :attention projections
      1.8      229,295,364        128      1,791,370.0      1,137,622.5        382,199     58,412,332    5,163,410.8  PushPop  :computing softmax
      1.3      161,988,135        128      1,265,532.3        825,050.5        336,479     24,145,062    2,225,677.1  PushPop  :computing attention scores
      0.7       93,249,368        128        728,510.7        258,756.5        187,597     38,442,354    3,456,221.0  PushPop  :causal mask construction
      0.7       89,274,806          4     22,318,701.5     23,529,256.5     16,101,895     26,114,398    4,805,676.2  PushPop  :layer_29
      0.7       88,709,681          4     22,177,420.3     24,795,781.5     13,006,635     26,111,483    6,204,590.2  PushPop  :layer_28
      0.7       85,584,381          4     21,396,095.3     23,124,333.5     12,993,029     26,342,685    6,290,173.3  PushPop  :layer_30
      0.7       83,281,904          4     20,820,476.0     22,308,517.0     12,331,664     26,333,206    6,704,505.7  PushPop  :layer_27
      0.6       79,761,791          4     19,940,447.8     20,496,168.0     12,646,931     26,122,524    7,181,660.7  PushPop  :layer_24
      0.6       79,581,900          4     19,895,475.0     20,983,472.5     11,494,575     26,120,380    7,366,885.9  PushPop  :layer_31
      0.6       76,702,512          4     19,175,628.0     19,460,410.5     11,657,183     26,124,508    8,036,266.3  PushPop  :layer_26
      0.6       76,223,576          4     19,055,894.0     19,060,257.5     12,022,599     26,080,462    8,111,200.6  PushPop  :layer_25
      0.5       62,961,418          4     15,740,354.5     15,874,480.0     13,117,802     18,094,656    2,471,137.6  PushPop  :layer_23
      0.4       57,511,056        128        449,305.1        380,003.5        336,599      2,851,142      244,297.6  PushPop  :final matmul
      0.4       51,890,826          4     12,972,706.5     12,655,352.5      5,844,914     20,735,207    8,179,394.2  PushPop  :layer_16
      0.4       50,645,806          4     12,661,451.5        604,773.5        456,133     48,980,126   24,212,577.8  PushPop  :token_embeddings
      0.4       50,584,668          4     12,646,167.0      8,811,446.5      5,912,356     27,049,419    9,955,475.3  PushPop  :layer_3
      0.4       49,668,133          4     12,417,033.3     10,926,908.0      5,941,683     21,872,634    7,861,423.4  PushPop  :layer_15
      0.4       49,584,559          4     12,396,139.8     11,964,785.0      5,999,837     19,655,152    7,368,607.1  PushPop  :layer_1
      0.4       46,970,531          4     11,742,632.8     11,442,507.5      5,916,804     18,168,712    6,742,340.5  PushPop  :layer_5
      0.4       46,459,330          4     11,614,832.5     11,104,466.0      5,851,025     18,399,373    6,605,588.5  PushPop  :layer_18
      0.4       45,400,676          4     11,350,169.0     10,986,790.0      5,820,215     17,606,881    6,364,266.3  PushPop  :layer_17
      0.4       45,390,007          4     11,347,501.8     10,124,156.5      6,058,071     19,083,623    6,395,641.8  PushPop  :layer_9
      0.4       45,289,429          4     11,322,357.3     10,261,495.5      5,941,001     18,825,437    6,402,332.2  PushPop  :layer_11
      0.3       42,348,161          4     10,587,040.3     10,304,710.0      5,959,809     15,778,932    5,311,080.9  PushPop  :layer_20
      0.3       42,093,984          4     10,523,496.0     10,286,389.5      5,883,048     15,638,157    5,220,993.6  PushPop  :layer_21
      0.3       41,921,005          4     10,480,251.3      9,575,562.5      5,981,601     16,788,279    5,375,139.7  PushPop  :layer_10
      0.3       41,690,223          4     10,422,555.8     10,410,256.5      5,912,977     14,956,733    5,187,652.1  PushPop  :layer_6
      0.3       40,567,013          4     10,141,753.3      9,421,661.0      5,993,133     15,730,558    4,846,130.0  PushPop  :layer_7
      0.3       40,392,251          4     10,098,062.8      9,329,130.0      5,898,528     15,835,463    4,945,688.4  PushPop  :layer_2
      0.3       39,925,348          4      9,981,337.0      8,979,661.0      5,895,222     16,070,804    4,944,447.8  PushPop  :layer_4
      0.3       39,694,847          4      9,923,711.8      9,852,875.0      5,869,793     14,119,304    4,605,114.7  PushPop  :layer_19
      0.3       39,655,318          4      9,913,829.5      9,747,448.5      5,896,935     14,263,486    4,581,969.7  PushPop  :layer_14
      0.3       39,558,050          4      9,889,512.5      9,439,206.0      5,856,877     14,822,761    4,638,748.7  PushPop  :layer_12
      0.3       39,198,423          4      9,799,605.8      9,470,652.0      6,008,974     14,248,145    4,402,646.5  PushPop  :layer_8
      0.3       38,472,094          4      9,618,023.5      9,104,621.0      5,950,370     14,312,482    4,282,023.8  PushPop  :layer_22
      0.3       37,093,394          4      9,273,348.5      8,965,593.5      6,000,618     13,161,589    3,763,297.4  PushPop  :layer_13
      0.2       22,541,470        128        176,105.2        162,453.0        146,497        425,122       36,386.6  PushPop  :output projection
      0.1        9,047,845          4      2,261,961.3      1,480,433.5        841,087      5,245,891    2,044,213.0  PushPop  :lm_head
      0.1        8,956,936          4      2,239,234.0      2,242,350.0        516,411      3,955,825    1,967,245.3  PushPop  :final_norm

[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)    StdDev (ns)            Name
 --------  ---------------  ---------  ------------  -------------  ---------  -----------  ------------  ----------------------
     74.9   10,727,933,987        118  90,914,694.8  100,118,737.0      4,048  214,439,550  31,076,672.9  poll
     13.6    1,941,101,621      4,750     408,653.0      111,448.0      1,002   10,931,032     536,883.6  read
      7.5    1,067,712,018      2,197     485,986.4      466,703.0      1,843   14,546,821     425,915.5  open64
      3.0      431,916,401      4,551      94,905.8       49,647.0      1,112   20,974,114     412,486.3  ioctl
      0.3       40,463,016     17,504       2,311.6        2,084.0      1,793      614,263       4,876.9  mmap64
      0.2       24,501,776        113     216,829.9        1,924.0      1,172   12,371,822   1,573,013.9  fopen
      0.1       21,419,709        938      22,835.5        4,238.0      2,274   14,808,663     483,320.9  socket
      0.1       19,886,714         96     207,153.3      111,298.0     77,010    5,230,030     577,880.0  pthread_create
      0.1       13,123,209      4,584       2,862.8        2,289.5      1,001       53,244       3,143.2  munmap
      0.1       10,120,083          2   5,060,041.5    5,060,041.5  5,059,546    5,060,537         700.7  nanosleep
      0.1        8,890,026         65     136,769.6       57,092.0     48,024    5,064,055     620,790.1  sleep
      0.0        5,969,677      2,585       2,309.4        1,663.0      1,001      208,719       5,365.4  write
      0.0        5,784,467        862       6,710.5        4,408.0      1,753       89,365       6,193.8  mmap
      0.0        1,896,913          3     632,304.3      852,759.0      4,389    1,039,765     551,770.8  fread
      0.0        1,363,358         91      14,982.0       15,731.0      1,262       59,928      10,432.9  fgets
      0.0        1,169,250          8     146,156.3       42,858.5      6,964      520,489     204,776.0  fopen64
      0.0          528,625         11      48,056.8       33,536.0     24,277      165,734      40,096.1  sem_timedwait
      0.0          314,546        195       1,613.1        1,222.0      1,002       11,162       1,526.1  pthread_cond_signal
      0.0          263,565         15      17,571.0        4,919.0      1,312      171,325      43,090.0  open
      0.0          161,136          1     161,136.0      161,136.0    161,136      161,136           0.0  pthread_cond_wait
      0.0           74,766          7      10,680.9       11,422.0      2,354       15,871       4,097.6  msgsnd
      0.0           56,841         11       5,167.4        5,240.0      1,002       14,378       4,056.6  pthread_cond_broadcast
      0.0           13,006          3       4,335.3        5,561.0      1,583        5,862       2,388.3  pipe2
      0.0            8,367          7       1,195.3        1,172.0      1,072        1,373         113.6  fcntl
      0.0            4,679          2       2,339.5        2,339.5      1,683        2,996         928.4  sigaction
      0.0            3,968          2       1,984.0        1,984.0      1,613        2,355         524.7  fwrite
      0.0            2,385          2       1,192.5        1,192.5      1,113        1,272         112.4  fflush
      0.0            2,134          1       2,134.0        2,134.0      2,134        2,134           0.0  openat64

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)                 Name
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  ------------  ---------------------------------
     27.3    3,642,255,174        292   12,473,476.6    6,686,692.0       10,150   77,013,961  16,971,943.3  cudaMemcpyAsync
     27.3    3,641,878,090        292   12,472,185.2    6,686,030.5        9,779   77,013,541  16,971,857.3  cudaMemcpyAsync
     12.3    1,644,181,348          4  411,045,337.0  445,189,418.5  292,173,892  461,628,619  80,323,915.4  cudaDeviceSynchronize
     12.3    1,644,175,465          4  411,043,866.3  445,187,835.0  292,171,828  461,627,967  80,324,401.9  cudaDeviceSynchronize
      5.6      743,243,840      7,329      101,411.4       10,411.0        5,751   44,795,302   1,077,893.2  cudaLaunchKernel
      5.5      740,041,084      7,329      100,974.4       10,049.0        5,531   44,793,449   1,077,873.4  cudaLaunchKernel
      4.7      633,610,941        929      682,035.5      551,860.0      310,177   15,563,932     652,309.7  cudaMalloc
      4.7      632,723,037        929      681,079.7      551,389.0      309,776   15,563,390     652,166.2  cudaMalloc
      0.1       13,703,250          4    3,425,812.5    4,239,210.0      799,906    4,424,924   1,752,800.4  cuLibraryLoadData
      0.1        7,095,752          1    7,095,752.0    7,095,752.0    7,095,752    7,095,752           0.0  cudaFree
      0.1        7,093,678          1    7,093,678.0    7,093,678.0    7,093,678    7,093,678           0.0  cudaFree
      0.0        4,429,468        292       15,169.4       16,457.5        2,194       62,843       7,951.2  cudaStreamSynchronize
      0.0        3,988,373        292       13,658.8       14,809.0        1,974       60,820       7,014.9  cudaStreamSynchronize
      0.0        2,094,160        928        2,256.6        1,483.0          481       22,935       1,797.5  cudaStreamIsCapturing_v10000
      0.0          962,354          2      481,177.0      481,177.0      458,968      503,386      31,408.3  cudaGetDeviceProperties_v2_v12000
      0.0          878,711         18       48,817.3          441.0          421      861,887     202,920.0  cudaEventCreateWithFlags
      0.0          215,729        784          275.2          171.0          101        5,350         254.1  cuGetProcAddress_v2
      0.0            8,106          3        2,702.0        1,503.0        1,122        5,481       2,414.2  cuInit
      0.0            7,997          2        3,998.5        3,998.5        1,674        6,323       3,287.3  cudaGetDriverEntryPoint_v11030
      0.0            4,440          4        1,110.0        1,158.0          641        1,483         348.8  cuLibraryGetKernel
      0.0            1,563          3          521.0          281.0          280        1,002         416.6  cuModuleGetLoadingMode

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     88.3    3,008,940,150        900  3,343,266.8  1,452,872.0  1,437,240  7,050,554  2,162,776.2  ampere_sgemm_128x64_tn
      1.9       63,096,270        128    492,939.6    485,293.5    481,309    594,524     27,582.1  ampere_sgemm_128x128_nn
      1.6       55,296,126      1,544     35,813.6     37,727.0     29,088     46,592      4,304.1  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      1.2       42,029,374        128    328,354.5    323,230.5    320,959    398,174     18,728.8  ampere_sgemm_128x128_tn
      1.1       36,128,955        256    141,128.7    138,704.0    133,376    149,663      6,500.2  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, floa…
      0.8       26,187,541        128    204,590.2    202,383.0    201,343    234,207      8,048.8  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.7       24,087,967        128    188,187.2    186,095.0    184,735    216,511      7,691.9  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.7       23,876,358        768     31,089.0     32,704.0     24,000     40,064      4,148.9  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.6       22,130,406        128    172,893.8    171,551.0    170,911    191,967      4,978.5  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.6       19,779,769        128    154,529.4    154,575.5    145,855    157,855      1,514.1  void at::native::vectorized_elementwise_kernel<(int)4, at::native::exp_kernel_cuda(at::TensorIterat…
      0.6       19,120,797        128    149,381.2    149,215.0    148,191    151,583        734.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<float, float, floa…
      0.5       15,628,649        768     20,349.8     14,400.0     13,824     33,856      8,413.1  void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctor_add<float>, std::arr…
      0.4       14,559,957        128    113,749.7    112,543.5    111,647    130,528      4,432.9  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<flo…
      0.3       11,429,057        128     89,289.5     88,703.0     86,815     99,455      2,424.3  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapp…
      0.3       11,138,154        128     87,016.8     86,847.5     86,399     89,664        612.2  void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_kernel_cuda(at::TensorIt…
      0.2        5,567,999        260     21,415.4     21,312.0     20,032     25,600        574.4  void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
      0.1        4,828,940        260     18,572.8     18,432.0     17,376     21,632        809.7  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MeanOps<fl…
      0.1        2,217,561        260      8,529.1      7,776.0      7,008     58,336      5,303.0  void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void …
      0.0          608,414        128      4,753.2      4,704.0      4,480      5,984        280.7  void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::n…
      0.0          571,165        260      2,196.8      2,176.0      1,984      3,040        170.3  void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctorOnSelf_add<float>, st…
      0.0          522,784        260      2,010.7      1,984.0      1,920      2,464        119.6  void at::native::vectorized_elementwise_kernel<(int)4, at::native::rsqrt_kernel_cuda(at::TensorIter…
      0.0          520,443        256      2,033.0      2,016.0      1,760      2,720        169.9  void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…
      0.0          249,437        128      1,948.7      1,920.0      1,760      2,688        148.5  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, std::array<c…
      0.0            3,392          1      3,392.0      3,392.0      3,392      3,392          0.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<unsigned int, (int)4, void …

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count    Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)            Operation
 --------  ---------------  -----  ------------  -----------  --------  ----------  ------------  ----------------------------
    100.0    3,548,951,388    292  12,153,943.1  6,252,206.0     1,888  76,263,521  16,791,999.3  [CUDA memcpy Host-to-Device]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)           Operation
 ----------  -----  --------  --------  --------  --------  -----------  ----------------------------
 13,627.402    292    46.669    26.214     0.010   104.858       42.616  [CUDA memcpy Host-to-Device]

Generated:
    /nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_forward.nsys-rep
    /nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_forward.sqlite
Saved: nsys_results/2.7B_ctx512_forward.nsys-rep

Profiling: size=2.7B, context_length=512, type=backward
Collecting data...
Using device: cuda

Model configuration:
  Size: 2.7B
  d_model: 2560
  d_ff: 10240
  num_layers: 32
  num_heads: 32
  vocab_size: 10000
  context_length: 512
  batch_size: 4

Initializing model...
Creating random batch...

Running profiled step 1/1...

Profiling complete.
Generating '/tmp/nsys-report-643e.qdstrm'
[1/8] [========================100%] 2.7B_ctx512_backward.nsys-rep
Processing 319426 events:
[2/8] [========================100%] 2.7B_ctx512_backward.sqlite
[3/8] Executing 'nvtx_sum' stats report

 Time (%)  Total Time (ns)  Instances     Avg (ns)         Med (ns)        Min (ns)       Max (ns)      StdDev (ns)    Style                    Range
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  -------------  -------  ---------------------------------------
     40.0    9,038,701,145          1  9,038,701,145.0  9,038,701,145.0  9,038,701,145  9,038,701,145            0.0  PushPop  :warmup
     11.4    2,580,601,956          1  2,580,601,956.0  2,580,601,956.0  2,580,601,956  2,580,601,956            0.0  PushPop  :profiled_step
      9.3    2,096,925,001        128     16,382,226.6      6,127,347.0      4,487,526    956,376,632   84,784,737.8  PushPop  :attention_sublayer
      7.7    1,734,457,535          1  1,734,457,535.0  1,734,457,535.0  1,734,457,535  1,734,457,535            0.0  PushPop  :backward_pass
      4.4      997,869,240          4    249,467,310.0      7,439,170.0      7,001,278    975,989,622  484,348,331.1  PushPop  :layer_0
      3.7      842,812,940          1    842,812,940.0    842,812,940.0    842,812,940    842,812,940            0.0  PushPop  :forward_pass
      3.3      739,320,395        128      5,775,940.6      1,336,301.5      1,194,849     62,729,175    8,455,523.5  PushPop  :ffn_sublayer
      2.8      627,142,985        128      4,899,554.6      1,822,227.5      1,329,303    144,953,360   15,513,585.1  PushPop  :scaled_dot_product_attention
      2.8      626,524,630        128      4,894,723.7      1,815,875.5      1,325,455    144,932,249   15,512,243.9  PushPop  :annotated_scaled_dot_product_attention
      1.6      362,249,576        128      2,830,074.8      1,741,765.5      1,635,813    100,411,102    8,720,479.9  PushPop  :rope encoding
      1.5      333,126,417        128      2,602,550.1        376,777.5        332,752    246,677,283   21,771,285.2  PushPop  :attention projections
      1.2      274,709,245        128      2,146,166.0        860,099.0        382,258     75,246,383    7,651,076.2  PushPop  :computing softmax
      1.2      271,745,113          4     67,936,278.3     26,154,763.0     21,527,378    197,908,209   86,675,407.8  PushPop  :layer_29
      1.0      227,741,444        128      1,779,230.0        370,736.0        307,953     46,345,999    6,204,920.8  PushPop  :computing attention scores
      0.6      142,421,504          4     35,605,376.0     26,077,867.5     25,932,182     64,333,587   19,152,286.3  PushPop  :layer_26
      0.6      136,041,482          4     34,010,370.5     28,645,670.0     26,109,819     52,640,323   12,643,561.3  PushPop  :layer_28
      0.6      126,512,656          4     31,628,164.0     26,205,958.0     26,082,647     48,018,093   10,927,151.8  PushPop  :layer_30
      0.5      114,962,665        128        898,145.8        203,293.0        184,571     46,472,848    5,076,259.9  PushPop  :causal mask construction
      0.4       95,830,304          4     23,957,576.0     26,105,897.0     17,486,156     26,132,354    4,314,313.3  PushPop  :layer_27
      0.4       88,848,393          4     22,212,098.3     25,692,813.0     11,293,521     26,169,246    7,288,336.9  PushPop  :layer_31
      0.4       88,736,626          4     22,184,156.5     24,073,262.5     14,438,429     26,151,672    5,503,309.8  PushPop  :layer_25
      0.4       79,379,235          4     19,844,808.8     22,947,263.0      7,351,133     26,133,576    8,849,435.0  PushPop  :layer_24
      0.3       76,297,051          4     19,074,262.8      7,165,048.5      5,875,173     56,091,781   24,704,712.8  PushPop  :layer_13
      0.3       59,512,107          4     14,878,026.8     14,860,931.0      7,436,058     22,354,187    6,468,240.0  PushPop  :layer_23
      0.2       55,741,434          4     13,935,358.5        448,884.0        249,548     54,594,118   27,106,006.7  PushPop  :token_embeddings
      0.2       50,340,479        128        393,285.0        368,376.5        329,164      2,626,663      203,675.4  PushPop  :final matmul
      0.2       36,142,801          4      9,035,700.3      8,277,425.5      5,727,053     13,860,897    3,973,407.0  PushPop  :layer_21
      0.2       35,761,577          4      8,940,394.3      6,869,224.0      5,853,060     16,170,069    4,910,824.5  PushPop  :layer_2
      0.2       34,757,470          4      8,689,367.5      7,636,290.5      5,780,838     13,704,051    3,736,557.0  PushPop  :layer_18
      0.2       34,523,872          4      8,630,968.0      8,070,350.0      5,800,627     12,582,545    3,372,567.1  PushPop  :layer_12
      0.2       34,080,434          4      8,520,108.5      6,754,163.5      5,854,342     14,717,765    4,205,817.4  PushPop  :layer_19
      0.1       33,513,083          4      8,378,270.8      6,998,136.0      5,849,813     13,666,998    3,656,560.1  PushPop  :layer_6
      0.1       32,801,401          4      8,200,350.3      6,888,301.5      5,932,104     13,092,694    3,297,104.7  PushPop  :layer_8
      0.1       32,565,199          4      8,141,299.8      7,240,857.0      5,838,080     12,245,405    2,816,447.1  PushPop  :layer_7
      0.1       31,837,704          4      7,959,426.0      6,523,703.0      5,848,630     12,941,668    3,381,872.5  PushPop  :layer_14
      0.1       31,498,641          4      7,874,660.3      6,475,794.0      5,838,922     12,708,131    3,269,957.7  PushPop  :layer_3
      0.1       31,348,507          4      7,837,126.8      6,649,840.0      5,891,345     12,157,482    2,950,555.2  PushPop  :layer_9
      0.1       31,231,769          4      7,807,942.3      6,621,113.5      5,791,900     12,197,642    3,017,970.3  PushPop  :layer_20
      0.1       31,131,060          4      7,782,765.0      6,795,294.0      5,920,692     11,619,780    2,616,144.6  PushPop  :layer_17
      0.1       31,050,682          4      7,762,670.5      6,469,591.5      5,834,423     12,277,076    3,065,818.9  PushPop  :layer_4
      0.1       30,847,456          4      7,711,864.0      6,501,799.5      5,850,986     11,992,871    2,905,318.3  PushPop  :layer_5
      0.1       30,766,828          4      7,691,707.0      6,546,382.0      5,847,429     11,826,635    2,826,192.3  PushPop  :layer_11
      0.1       30,737,009          4      7,684,252.3      6,848,839.0      5,787,211     11,252,120    2,431,495.3  PushPop  :layer_16
      0.1       30,503,764          4      7,625,941.0      6,613,569.0      5,754,397     11,522,229    2,716,611.7  PushPop  :layer_22
      0.1       30,480,818          4      7,620,204.5      6,521,132.5      5,763,384     11,675,169    2,760,575.6  PushPop  :layer_15
      0.1       30,471,028          4      7,617,757.0      6,584,306.5      5,849,422     11,452,993    2,624,857.7  PushPop  :layer_10
      0.1       29,872,717          4      7,468,179.3      5,907,241.0      5,856,306     12,201,929    3,155,980.2  PushPop  :layer_1
      0.1       20,584,393        128        160,815.6        156,551.5        144,513        249,629       16,515.2  PushPop  :output projection
      0.0       11,050,406          4      2,762,601.5        463,026.5        152,739      9,971,614    4,814,845.7  PushPop  :lm_head
      0.0       11,005,837          4      2,751,459.3      3,281,549.5        495,019      3,947,719    1,612,596.7  PushPop  :final_norm
      0.0        3,261,275          1      3,261,275.0      3,261,275.0      3,261,275      3,261,275            0.0  PushPop  :loss_computation

[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls     Avg (ns)        Med (ns)     Min (ns)      Max (ns)       StdDev (ns)             Name
 --------  ---------------  ---------  ---------------  -------------  ---------  --------------  ---------------  ----------------------
     47.0   21,952,963,197        230     95,447,666.1  100,118,655.0      3,066     225,544,678     22,928,907.3  poll
     43.6   20,373,469,766         13  1,567,189,982.0  972,600,948.0    133,942  10,186,269,697  2,688,018,597.0  pthread_cond_wait
      4.2    1,944,993,312      4,133        470,600.8      184,401.0      1,001      21,478,101        750,696.6  read
      2.6    1,214,626,381      3,715        326,951.9       47,663.0      1,092      55,529,039      2,437,572.0  ioctl
      2.3    1,058,507,220      2,197        481,796.6      464,700.0      2,815      11,177,123        348,200.1  open64
      0.1       40,133,388     17,511          2,291.9        1,964.0      1,804         761,171          6,002.0  mmap64
      0.0       20,772,645      4,650          4,467.2        2,655.0      1,001       5,968,306         87,551.7  munmap
      0.0       16,188,599        730         22,176.2        3,957.5      1,753       9,800,718        362,491.9  socket
      0.0       15,074,102         98        153,817.4      115,446.0     67,833       2,227,251        222,921.1  pthread_create
      0.0       12,955,655         15        863,710.3        5,010.0      1,302      12,870,659      3,321,626.7  open
      0.0       12,783,328        113        113,126.8        1,774.0      1,092      11,896,882      1,118,648.7  fopen
      0.0       10,122,870          2      5,061,435.0    5,061,435.0  5,061,039       5,061,831            560.0  nanosleep
      0.0        7,425,546        665         11,166.2        4,238.0      1,593       2,608,507        102,132.6  mmap
      0.0        4,026,846      1,707          2,359.0        1,583.0      1,001          26,492          2,137.1  write
      0.0        3,638,160         65         55,971.7       56,120.0     39,176          69,045          4,399.7  sleep
      0.0        1,932,443          3        644,147.7      905,352.0      3,416       1,023,675        558,034.8  fread
      0.0        1,246,452          8        155,806.5       43,565.5      4,800         591,558        235,234.9  fopen64
      0.0        1,134,247         74         15,327.7       12,925.0      1,182          58,615          9,676.5  fgets
      0.0          457,155          2        228,577.5      228,577.5     51,030         406,125        251,090.1  pthread_cond_timedwait
      0.0          456,465         10         45,646.5       34,322.0     17,154         109,525         34,022.9  sem_timedwait
      0.0          355,771        146          2,436.8        1,172.5      1,002          11,863          2,576.0  pthread_cond_signal
      0.0          235,762          1        235,762.0      235,762.0    235,762         235,762              0.0  pthread_mutex_lock
      0.0          214,280          4         53,570.0        7,244.0      1,123         198,669         96,791.1  pthread_cond_broadcast
      0.0           76,551          7         10,935.9        8,096.0      4,058          21,783          6,338.3  msgsnd
      0.0            9,418          3          3,139.3        2,625.0      1,443           5,350          2,003.6  pipe2
      0.0            6,452          4          1,613.0        1,493.0      1,312           2,154            383.4  fflush
      0.0            5,440          2          2,720.0        2,720.0      2,284           3,156            616.6  sigaction
      0.0            5,150          4          1,287.5        1,247.5      1,052           1,603            229.7  fcntl
      0.0            4,298          2          2,149.0        2,149.0      1,422           2,876          1,028.1  fwrite
      0.0            2,324          1          2,324.0        2,324.0      2,324           2,324              0.0  openat64
      0.0            1,052          1          1,052.0        1,052.0      1,052           1,052              0.0  listen

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)     Min (ns)   Max (ns)     StdDev (ns)                 Name
 --------  ---------------  ---------  -------------  -------------  --------  -----------  -------------  ---------------------------------
     21.4    7,247,233,645        560   12,941,488.7    1,325,660.5     9,599  119,387,515   24,681,326.6  cudaMemcpyAsync
     21.4    7,246,813,399        560   12,940,738.2    1,325,399.5     9,077  119,386,754   24,681,286.0  cudaMemcpyAsync
     17.4    5,898,486,517     19,695      299,491.6        9,428.0     5,310  495,334,972    3,778,261.6  cudaLaunchKernel
     17.4    5,879,592,231     19,695      298,532.2        9,067.0     5,100  495,334,451    3,777,467.9  cudaLaunchKernel
      5.8    1,954,680,414          5  390,936,082.8  494,370,574.0    45,830  494,620,774  218,750,348.8  cudaDeviceSynchronize
      5.8    1,954,675,694          5  390,935,138.8  494,369,412.0    44,447  494,619,652  218,750,577.8  cudaDeviceSynchronize
      3.9    1,329,778,478        720    1,846,914.6      466,047.5   117,400   46,091,250    5,374,718.0  cudaMalloc
      3.9    1,329,237,922        720    1,846,163.8      465,656.5   117,119   46,090,329    5,374,621.9  cudaMalloc
      2.6      871,466,771      1,544      564,421.5        8,917.0     7,485    5,688,998    1,621,205.1  cuLaunchKernel
      0.2       64,510,122        260      248,115.9        9,649.0     6,622    5,681,183      708,276.6  cudaMemsetAsync
      0.2       64,437,317        260      247,835.8        8,892.5     6,432    5,680,892      708,289.4  cudaMemsetAsync
      0.0       12,883,425          4    3,220,856.3    3,849,021.0   736,913    4,448,470    1,682,105.8  cuLibraryLoadData
      0.0        7,134,899          2    3,567,449.5    3,567,449.5   130,686    7,004,213    4,860,317.6  cudaFree
      0.0        7,133,016          2    3,566,508.0    3,566,508.0   130,245    7,002,771    4,859,609.7  cudaFree
      0.0        4,369,272        292       14,963.3       16,242.0     2,094       65,689        7,732.7  cudaStreamSynchronize
      0.0        3,879,440        292       13,285.8       14,408.5     1,894       56,441        6,777.7  cudaStreamSynchronize
      0.0        1,702,864        716        2,378.3        1,433.0       501       26,021        2,049.2  cudaStreamIsCapturing_v10000
      0.0        1,516,692        772        1,964.6          772.0       611      885,995       31,860.1  cuKernelGetFunction
      0.0          961,944         36       26,720.7          421.0       410      927,456      154,417.6  cudaEventCreateWithFlags
      0.0          941,753          2      470,876.5      470,876.5   450,201      491,552       29,239.6  cudaGetDeviceProperties_v2_v12000
      0.0          253,757        784          323.7          281.0       110        6,513          298.7  cuGetProcAddress_v2
      0.0           12,023          3        4,007.7        2,705.0     1,633        7,685        3,229.5  cuInit
      0.0            6,813          5        1,362.6        1,563.0       832        1,723          419.8  cuLibraryGetKernel
      0.0            2,175          2        1,087.5        1,087.5       682        1,493          573.5  cudaGetDriverEntryPoint_v11030
      0.0            1,712          3          570.7          170.0       130        1,412          728.9  cuModuleGetLoadingMode

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     29.0    2,980,566,485        900  3,311,740.5  1,448,649.0  1,431,800  7,050,586  2,135,138.7  ampere_sgemm_128x64_tn
     21.8    2,235,749,205        772  2,896,048.2  1,447,000.5  1,444,857  5,774,017  2,035,002.0  void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1>(T1::Params)
     14.5    1,487,949,867        260  5,722,884.1  5,725,188.0  5,599,655  5,739,617     15,506.8  ampere_sgemm_128x32_sliced1x4_nn
      8.3      853,988,820        768  1,111,964.6  1,423,960.5    479,774  1,438,456    442,264.6  ampere_sgemm_128x128_nn
      7.1      731,375,952        128  5,713,874.6  5,712,915.5  5,708,065  5,761,569      6,157.3  ampere_sgemm_128x64_nn
      7.1      727,644,706        128  5,684,724.3  5,683,878.5  5,682,278  5,743,521      6,273.2  ampere_sgemm_128x64_nt
      1.9      197,378,632      2,064     95,629.2     41,008.0      1,919    234,687     74,285.4  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, floa…
      1.7      179,205,434        931    192,487.0    171,327.0     86,752    372,990    110,611.7  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      1.2      122,742,918        256    479,464.5    479,357.0    478,877    484,510        537.1  ampere_sgemm_128x128_nt
      1.0      103,020,714      2,634     39,111.9     30,400.0      2,176    237,246     52,548.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctor_add<float>, std::arr…
      1.0      102,337,256        512    199,877.5    204,191.0    181,311    233,662      9,006.4  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.9       93,117,311      1,769     52,638.4     34,016.0     24,096    359,678     66,545.4  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.9       91,351,725      3,088     29,582.8     30,624.0     11,712     46,592      8,070.4  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.8       83,057,605        256    324,443.8    323,326.0    319,870    397,982      9,076.8  ampere_sgemm_128x128_tn
      0.5       47,711,927        256    186,374.7    185,919.0    184,575    216,543      3,737.4  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.4       42,196,216        644     65,522.1     88,191.5     25,343     97,759     32,415.8  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapp…
      0.4       39,022,817        256    152,432.9    152,415.0    148,127    159,423      3,297.3  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<float, float, floa…
      0.4       38,268,484        256    149,486.3    149,967.0    145,279    154,495      2,994.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::neg_kernel_cuda(at::TensorIterat…
      0.2       19,788,411        128    154,597.0    154,543.5    145,343    157,951      1,412.5  void at::native::vectorized_elementwise_kernel<(int)4, at::native::exp_kernel_cuda(at::TensorIterat…
      0.2       18,542,500        128    144,863.3    144,911.0    142,463    145,951        483.2  void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_backward_kernel_cuda(at:…
      0.1       14,484,187        128    113,157.7    112,543.5    111,903    129,887      2,961.6  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<flo…
      0.1       11,130,151        128     86,954.3     86,879.0     86,239     89,087        453.3  void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_kernel_cuda(at::TensorIt…
      0.1       10,504,807        408     25,747.1      2,016.0      1,728     76,320     31,970.6  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, std::array<c…
      0.1        5,542,934        260     21,319.0     21,312.0     19,903     25,376        510.8  void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
      0.1        5,499,781        128     42,967.0     42,960.0     42,528     43,711        209.0  void at::native::<unnamed>::CatArrayBatchedCopy<at::native::<unnamed>::OpaqueType<(unsigned int)4>,…
      0.1        5,171,306        260     19,889.6     19,872.0     18,848     20,512        201.6  void at::native::reduce_kernel<(int)128, (int)4, at::native::ReduceOp<float, at::native::func_wrapp…
      0.0        5,097,957        520      9,803.8      9,328.0      1,920     20,480      7,428.6  void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<float, float, floa…
      0.0        4,804,234        260     18,477.8     18,415.5     17,280     22,368        628.2  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MeanOps<fl…
      0.0        4,712,746        260     18,125.9     18,112.0     17,984     18,368         64.7  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.0        4,690,950        256     18,324.0     18,080.0     17,344     20,128        832.3  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.0        3,374,223        128     26,361.1     26,336.0     25,728     27,520        336.4  void at::native::<unnamed>::CatArrayBatchedCopy_aligned16_contig<at::native::<unnamed>::OpaqueType<…
      0.0        2,255,125        260      8,673.6      7,840.0      7,168     58,816      5,454.5  void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void …
      0.0        1,481,783        128     11,576.4     11,584.0     10,784     12,800        362.5  void at::native::_scatter_gather_elementwise_kernel<(int)128, (int)4, void at::native::_cuda_scatte…
      0.0          720,348          4    180,087.0    179,839.0    179,743    180,927        564.3  void at::native::<unnamed>::cunn_SoftMaxBackward<(int)4, float, float, float, at::native::<unnamed>…
      0.0          677,307        260      2,605.0      2,592.0      2,272      3,040        125.2  void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
      0.0          608,670        128      4,755.2      4,720.0      4,544      5,824        190.5  void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::n…
      0.0          565,404        260      2,174.6      2,159.5      1,984      3,136        150.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctorOnSelf_add<float>, st…
      0.0          536,861        260      2,064.8      2,048.0      1,792      2,912        187.4  void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…
      0.0          517,631        260      1,990.9      1,984.0      1,920      2,464         79.7  void at::native::vectorized_elementwise_kernel<(int)4, at::native::rsqrt_kernel_cuda(at::TensorIter…
      0.0          440,733          4    110,183.3    110,447.5    108,351    111,487      1,475.3  void at::native::<unnamed>::cunn_SoftMaxForwardSmem<(int)4, float, float, float, at::native::<unnam…
      0.0          275,103          4     68,775.8     68,815.5     68,160     69,312        622.1  void at::native::<unnamed>::nll_loss_forward_reduce_cuda_kernel_2d<float, float, long>(T1 *, T1 *, …
      0.0          162,847          4     40,711.8     40,768.0     40,256     41,055        348.6  void <unnamed>::indexing_backward_kernel<float, (int)4>(const long *, const long *, const T1 *, T1 …
      0.0          157,791          4     39,447.8     39,504.0     39,167     39,616        205.1  void at::native::<unnamed>::nll_loss_backward_reduce_cuda_kernel_2d<float, long>(T1 *, const T1 *, …
      0.0           42,048          4     10,512.0     10,496.0     10,432     10,624         84.7  void at_cuda_detail::cub::DeviceRadixSortSingleTileKernel<at_cuda_detail::cub::DeviceRadixSortPolic…
      0.0           16,064          4      4,016.0      4,016.0      3,712      4,320        251.3  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.0           12,736          4      3,184.0      3,232.0      2,944      3,328        183.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<long, long, long, …
      0.0           10,368          4      2,592.0      2,592.0      2,592      2,592          0.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<long, long, long, …
      0.0            9,600          4      2,400.0      2,384.0      2,336      2,496         69.1  void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<long, long, long, …
      0.0            3,488          1      3,488.0      3,488.0      3,488      3,488          0.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<unsigned int, (int)4, void …

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)             Operation
 --------  ---------------  -----  ------------  ------------  --------  -----------  ------------  ------------------------------
     99.9    7,007,930,722    292  23,999,762.7  15,852,378.5     1,888  118,282,014  29,602,623.7  [CUDA memcpy Host-to-Device]
      0.1        7,517,185    268      28,049.2      25,440.0    24,896      119,295      13,996.3  [CUDA memcpy Device-to-Device]
      0.0          548,605    260       2,110.0       2,112.0     1,024        2,528          80.4  [CUDA memset]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)            Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------------------
 13,627.402    292    46.669    26.214     0.010   104.858       42.616  [CUDA memcpy Host-to-Device]
  6,189.235    268    23.094    20.972    20.972   102.400       12.254  [CUDA memcpy Device-to-Device]
      0.021    260     0.000     0.000     0.000     0.000        0.000  [CUDA memset]

Generated:
    /nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_backward.nsys-rep
    /nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_backward.sqlite
Saved: nsys_results/2.7B_ctx512_backward.nsys-rep

Profiling: size=2.7B, context_length=512, type=train
Collecting data...
Using device: cuda

Model configuration:
  Size: 2.7B
  d_model: 2560
  d_ff: 10240
  num_layers: 32
  num_heads: 32
  vocab_size: 10000
  context_length: 512
  batch_size: 4

Initializing model...
Creating random batch...

Running profiled step 1/1...

Profiling complete.
Generating '/tmp/nsys-report-477e.qdstrm'
[1/8] [========================100%] 2.7B_ctx512_train.nsys-rep
Processing 339233 events:
[2/8] [========================100%] 2.7B_ctx512_train.sqlite
[3/8] Executing 'nvtx_sum' stats report

 Time (%)  Total Time (ns)  Instances     Avg (ns)         Med (ns)        Min (ns)       Max (ns)      StdDev (ns)    Style                    Range
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  -------------  -------  ---------------------------------------
     43.2    9,866,325,771          1  9,866,325,771.0  9,866,325,771.0  9,866,325,771  9,866,325,771            0.0  PushPop  :warmup
     12.1    2,758,088,295          1  2,758,088,295.0  2,758,088,295.0  2,758,088,295  2,758,088,295            0.0  PushPop  :profiled_step
      8.3    1,896,613,679        128     14,817,294.4      5,119,042.5      4,546,131    859,285,061   75,684,546.2  PushPop  :attention_sublayer
      7.6    1,732,475,648          1  1,732,475,648.0  1,732,475,648.0  1,732,475,648  1,732,475,648            0.0  PushPop  :backward_pass
      4.0      903,231,175          4    225,807,793.8      7,524,818.5      7,188,525    880,993,013  436,790,217.6  PushPop  :layer_0
      3.7      843,662,433          1    843,662,433.0    843,662,433.0    843,662,433    843,662,433            0.0  PushPop  :forward_pass
      3.1      701,380,211        128      5,479,532.9      1,354,732.5      1,204,849     21,647,074    6,535,892.6  PushPop  :ffn_sublayer
      2.1      489,582,879        128      3,824,866.2      1,453,305.0      1,348,942    119,793,927   11,735,299.0  PushPop  :scaled_dot_product_attention
      2.1      488,840,469        128      3,819,066.2      1,448,836.5      1,345,043    119,781,121   11,734,047.3  PushPop  :annotated_scaled_dot_product_attention
      1.6      356,320,108        128      2,783,750.8      1,740,838.0      1,634,820    100,288,567    8,713,326.4  PushPop  :rope encoding
      1.4      329,760,873        128      2,576,256.8        371,437.5        339,595    183,266,201   16,763,493.9  PushPop  :attention projections
      0.9      195,493,826        128      1,527,295.5        379,047.5        340,877     44,298,402    5,474,557.1  PushPop  :computing attention scores
      0.8      178,760,308          1    178,760,308.0    178,760,308.0    178,760,308    178,760,308            0.0  PushPop  :optimizer_step
      0.7      163,841,492        128      1,280,011.7        424,596.5        386,888     50,208,544    4,438,987.6  PushPop  :computing softmax
      0.6      140,356,919          4     35,089,229.8     26,119,665.5     26,085,203     62,032,385   17,962,112.7  PushPop  :layer_27
      0.5      105,233,309          4     26,308,327.3     26,052,554.0     25,826,967     27,301,234      671,355.4  PushPop  :layer_26
      0.4       96,421,866          4     24,105,466.5     26,089,967.0     18,037,847     26,204,085    4,045,466.9  PushPop  :layer_28
      0.4       95,504,771          4     23,876,192.8     26,059,161.5     17,172,503     26,213,945    4,469,902.8  PushPop  :layer_29
      0.4       92,776,559          4     23,194,139.8     26,062,102.5     14,544,517     26,107,837    5,766,560.0  PushPop  :layer_30
      0.4       92,234,238          4     23,058,559.5     26,032,685.0     14,091,631     26,077,237    5,978,031.5  PushPop  :layer_31
      0.4       90,672,651          4     22,668,162.8     26,121,999.5     12,256,448     26,172,204    6,941,258.7  PushPop  :layer_24
      0.4       90,165,106          4     22,541,276.5     26,143,963.0     11,440,600     26,436,580    7,401,742.5  PushPop  :layer_25
      0.3       78,305,758        128        611,763.7        197,978.0        186,926     40,472,612    3,562,877.8  PushPop  :causal mask construction
      0.3       75,908,733          4     18,977,183.3      6,055,561.5      6,021,048     57,776,562   25,866,268.5  PushPop  :layer_13
      0.3       70,968,732          4     17,742,183.0      6,057,921.5      5,969,237     52,883,652   23,427,731.1  PushPop  :layer_8
      0.3       69,634,238          4     17,408,559.5      5,969,758.0      5,930,191     51,764,531   22,903,993.2  PushPop  :layer_17
      0.3       66,908,110          4     16,727,027.5     17,452,280.0     12,234,965     19,768,585    3,364,546.9  PushPop  :layer_23
      0.2       52,802,349        128        412,518.4        377,082.5        336,238      2,132,946      168,418.0  PushPop  :final matmul
      0.2       49,097,929          4     12,274,482.3        419,506.0        328,904     47,930,013   23,770,392.6  PushPop  :token_embeddings
      0.2       38,027,865          4      9,506,966.3      6,234,541.5      5,954,288     19,604,494    6,736,513.5  PushPop  :layer_15
      0.2       34,966,762          4      8,741,690.5      6,427,219.0      5,912,947     16,199,377    4,992,101.7  PushPop  :layer_5
      0.2       34,304,385          4      8,576,096.3      6,080,630.5      5,899,981     16,243,143    5,113,632.9  PushPop  :layer_4
      0.1       34,070,137          4      8,517,534.3      5,947,304.5      5,883,239     16,292,289    5,183,374.5  PushPop  :layer_14
      0.1       32,848,815          4      8,212,203.8      5,941,763.5      5,902,877     15,062,411    4,566,877.7  PushPop  :layer_20
      0.1       32,502,055          4      8,125,513.8      5,998,494.0      5,826,338     14,678,729    4,370,046.7  PushPop  :layer_19
      0.1       32,417,751          4      8,104,437.8      6,533,502.0      5,935,090     13,415,657    3,560,959.5  PushPop  :layer_6
      0.1       32,416,460          4      8,104,115.0      6,506,584.5      6,014,687     13,388,604    3,550,982.2  PushPop  :layer_1
      0.1       31,808,618          4      7,952,154.5      5,921,714.0      5,891,485     14,073,705    4,081,060.2  PushPop  :layer_16
      0.1       31,789,241          4      7,947,310.3      6,009,927.0      5,963,225     13,806,162    3,906,027.9  PushPop  :layer_7
      0.1       31,706,038          4      7,926,509.5      6,383,368.0      5,973,456     12,965,846    3,370,600.0  PushPop  :layer_2
      0.1       31,372,556          4      7,843,139.0      5,990,870.0      5,851,477     13,539,339    3,798,885.1  PushPop  :layer_9
      0.1       31,361,794          4      7,840,448.5      5,970,555.0      5,951,453     13,469,231    3,752,534.3  PushPop  :layer_18
      0.1       30,846,866          4      7,711,716.5      5,997,508.0      5,889,451     12,962,399    3,501,390.9  PushPop  :layer_11
      0.1       30,595,954          4      7,648,988.5      6,079,518.0      5,939,329     12,497,589    3,234,363.9  PushPop  :layer_21
      0.1       30,432,163          4      7,608,040.8      6,164,018.5      5,884,161     12,219,965    3,079,599.0  PushPop  :layer_12
      0.1       30,313,833          4      7,578,458.3      6,130,974.5      5,885,794     12,166,090    3,065,597.2  PushPop  :layer_3
      0.1       29,851,817          4      7,462,954.3      6,027,401.0      5,859,061     11,937,954    2,984,435.2  PushPop  :layer_22
      0.1       29,439,851          4      7,359,962.8      5,987,648.5      5,851,296     11,613,258    2,836,778.7  PushPop  :layer_10
      0.1       21,192,761        128        165,568.4        158,084.5        146,446        260,840       20,724.3  PushPop  :output projection
      0.1       12,491,567          4      3,122,891.8      3,934,268.0        608,000      4,015,031    1,677,064.7  PushPop  :final_norm
      0.0        3,113,887          1      3,113,887.0      3,113,887.0      3,113,887      3,113,887            0.0  PushPop  :loss_computation
      0.0        1,522,931          4        380,732.8        167,282.5        158,440      1,029,926      432,842.9  PushPop  :lm_head

[4/8] Executing 'osrt_sum' stats report

 Time (%)  Total Time (ns)  Num Calls     Avg (ns)         Med (ns)      Min (ns)      Max (ns)       StdDev (ns)             Name
 --------  ---------------  ---------  ---------------  ---------------  ---------  --------------  ---------------  ----------------------
     51.8   22,486,241,609         12  1,873,853,467.4  1,174,244,972.5    159,071  11,242,482,846  3,031,915,471.2  pthread_cond_wait
     38.8   16,855,550,195        179     94,165,084.9    100,119,887.0      2,615     232,676,522     26,328,365.5  poll
      4.4    1,925,983,440      4,468        431,061.6        122,855.5      1,002       7,254,614        526,839.0  read
      2.5    1,095,783,652      2,197        498,763.6        490,690.0      2,986      11,305,435        350,983.0  open64
      2.0      878,762,723      4,392        200,082.6         50,429.0      1,002      40,327,698      1,790,688.9  ioctl
      0.1       41,107,956     17,513          2,347.3          1,994.0      1,814         596,387          5,073.7  mmap64
      0.0       16,781,731         98        171,242.2        117,013.5     82,622       3,196,468        336,337.1  pthread_create
      0.0       14,936,411      5,746          2,599.4          1,423.0      1,001          56,450          3,102.5  munmap
      0.0       13,846,094        900         15,384.5          3,948.0      2,135       6,482,663        215,944.1  socket
      0.0       12,498,781        113        110,608.7          1,973.0      1,082      11,304,413      1,062,900.9  fopen
      0.0       12,419,584         15        827,972.3          6,262.0      1,172      12,330,732      3,182,146.8  open
      0.0       10,076,940          2      5,038,470.0      5,038,470.0  5,017,935       5,059,005         29,040.9  nanosleep
      0.0        8,122,423         65        124,960.4         56,450.0     39,517       4,049,118        495,490.0  sleep
      0.0        5,727,743      2,384          2,402.6          1,613.0      1,001          40,921          2,411.9  write
      0.0        5,676,611        841          6,749.8          4,439.0      1,262          49,998          5,138.8  mmap
      0.0        1,613,690          3        537,896.7        701,854.0      3,717         908,119        473,969.7  fread
      0.0        1,170,573          8        146,321.6         41,331.0      5,010         615,385        223,847.6  fopen64
      0.0        1,136,181         74         15,353.8         12,820.0      1,172          58,634          9,813.4  fgets
      0.0          516,992         11         46,999.3         30,880.0     14,348         235,521         63,615.9  sem_timedwait
      0.0          453,798        203          2,235.5          1,283.0      1,002          14,218          2,293.0  pthread_cond_signal
      0.0          116,238          2         58,119.0         58,119.0     24,198          92,040         47,971.5  pthread_cond_timedwait
      0.0           70,900          7         10,128.6         10,711.0      5,701          13,447          3,154.7  msgsnd
      0.0           21,923          1         21,923.0         21,923.0     21,923          21,923              0.0  pthread_mutex_lock
      0.0           14,539          3          4,846.3          5,461.0      1,593           7,485          2,993.7  pipe2
      0.0           11,442          7          1,634.6          1,733.0      1,122           2,125            325.5  fcntl
      0.0            5,701          2          2,850.5          2,850.5      2,234           3,467            871.9  sigaction
      0.0            5,551          2          2,775.5          2,775.5      1,533           4,018          1,757.2  fwrite
      0.0            4,979          3          1,659.7          1,873.0      1,142           1,964            450.6  fflush
      0.0            4,548          2          2,274.0          2,274.0      1,092           3,456          1,671.6  pthread_cond_broadcast
      0.0            2,365          1          2,365.0          2,365.0      2,365           2,365              0.0  openat64
      0.0            1,182          1          1,182.0          1,182.0      1,182           1,182              0.0  pthread_mutex_trylock

[5/8] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)    Max (ns)     StdDev (ns)                 Name
 --------  ---------------  ---------  -------------  -------------  ----------  -----------  -------------  ---------------------------------
     32.7    7,699,062,535     25,295      304,370.9        9,278.0       4,599  495,402,097    3,601,347.1  cudaLaunchKernel
     32.6    7,688,218,088     25,295      303,942.2        8,957.0       4,408  495,401,787    3,601,310.4  cudaLaunchKernel
      6.8    1,590,651,814        560    2,840,449.7      228,522.5       7,394   24,446,695    5,266,448.2  cudaMemcpyAsync
      6.8    1,590,368,607        560    2,839,943.9      228,267.0       7,084   24,446,044    5,266,294.5  cudaMemcpyAsync
      6.0    1,404,040,027          6  234,006,671.2  125,623,731.5  59,392,445  496,968,134  195,461,349.5  cudaDeviceSynchronize
      6.0    1,404,036,261          6  234,006,043.5  125,623,425.5  59,391,975  496,966,310  195,460,975.3  cudaDeviceSynchronize
      4.1      962,781,997        890    1,081,777.5      535,348.0     283,475   45,822,988    3,747,493.7  cudaMalloc
      4.1      961,965,203        890    1,080,859.8      534,226.0     282,143   45,822,347    3,746,358.4  cudaMalloc
      0.3       75,979,762      1,544       49,209.7        8,812.5       7,415    5,644,171      320,920.8  cuLaunchKernel
      0.3       71,308,169        260      274,262.2      110,336.0       6,302    5,737,494      686,387.2  cudaMemsetAsync
      0.3       71,241,386        260      274,005.3      110,105.0       6,012    5,737,223      686,386.7  cudaMemsetAsync
      0.0       11,032,542          4    2,758,135.5    3,403,764.5     630,926    3,594,087    1,421,895.7  cuLibraryLoadData
      0.0        6,668,908          2    3,334,454.0    3,334,454.0     369,815    6,299,093    4,192,632.7  cudaFree
      0.0        6,665,871          2    3,332,935.5    3,332,935.5     367,800    6,298,071    4,193,334.8  cudaFree
      0.0        3,951,073        292       13,531.1       14,604.0       2,084       33,546        5,919.0  cudaStreamSynchronize
      0.0        3,588,531        292       12,289.5       13,036.0       1,804       31,662        5,429.9  cudaStreamSynchronize
      0.0        1,935,776        890        2,175.0        1,433.0         481       30,890        1,908.3  cudaStreamIsCapturing_v10000
      0.0        1,530,948        772        1,983.1          752.0         601      850,335       30,583.2  cuKernelGetFunction
      0.0          941,282          2      470,641.0      470,641.0     462,766      478,516       11,136.9  cudaGetDeviceProperties_v2_v12000
      0.0          867,526         36       24,097.9          446.0         411      812,350      135,184.7  cudaEventCreateWithFlags
      0.0          199,995        784          255.1          190.0         110        1,503          154.6  cuGetProcAddress_v2
      0.0            6,353          3        2,117.7        1,763.0       1,243        3,347        1,095.9  cuInit
      0.0            5,982          5        1,196.4          992.0         792        2,204          574.9  cuLibraryGetKernel
      0.0            1,833          2          916.5          916.5         571        1,262          488.6  cudaGetDriverEntryPoint_v11030
      0.0            1,392          3          464.0          190.0         150        1,052          509.6  cuModuleGetLoadingMode

[6/8] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     27.3    3,002,482,417        900  3,336,091.6  1,453,497.5  1,432,568  7,054,938  2,156,094.1  ampere_sgemm_128x64_tn
     20.3    2,235,157,813        772  2,895,282.1  1,446,522.0  1,444,856  5,804,966  2,034,343.4  void cutlass::Kernel2<cutlass_80_simt_sgemm_128x32_8x5_nt_align1>(T1::Params)
     13.5    1,487,779,391        260  5,722,228.4  5,723,684.5  5,599,298  5,744,839     15,668.3  ampere_sgemm_128x32_sliced1x4_nn
      7.8      854,356,834        768  1,112,443.8  1,423,609.0    480,350  1,438,010    441,460.3  ampere_sgemm_128x128_nn
      6.6      731,264,017        128  5,713,000.1  5,711,075.5  5,708,390  5,767,143      8,255.9  ampere_sgemm_128x64_nn
      6.6      727,675,106        128  5,684,961.8  5,683,556.0  5,681,734  5,742,566      7,505.3  ampere_sgemm_128x64_nt
      2.0      219,552,934      2,444     89,833.4     36,608.0     24,128    338,974    103,994.8  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      1.8      197,551,348      2,064     95,712.9     41,424.0      1,888    235,102     74,258.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BinaryFunctor<float, float, floa…
      1.3      143,804,577      1,304    110,279.6    110,304.0     71,264    121,024      3,919.4  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
      1.2      130,989,437        652    200,904.0    200,959.0    131,615    213,407      5,854.9  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListScalarListMe…
      1.1      122,711,755        256    479,342.8    479,230.0    478,877    481,950        445.4  ampere_sgemm_128x128_nt
      0.9      102,490,391        512    200,176.5    204,207.0    181,567    233,983      9,393.3  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.9      102,152,443      2,436     41,934.5     30,560.0     11,488    237,759     53,409.9  void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctor_add<float>, std::arr…
      0.9      102,036,288        652    156,497.4    156,575.0    102,623    165,471      4,821.4  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
      0.9      101,626,725        652    155,869.2    155,999.5    101,600    164,735      4,865.5  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
      0.8       91,768,948      3,088     29,717.9     30,655.0     11,648     47,808      8,207.7  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.8       83,350,734        256    325,588.8    323,327.0    320,446    398,014     12,688.0  ampere_sgemm_128x128_tn
      0.7       73,081,191        652    112,087.7    112,128.0     72,640    121,696      3,779.5  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListScalarListMe…
      0.7       72,100,619        652    110,583.8    110,559.5     70,656    121,343      3,893.2  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
      0.7       71,606,898        652    109,826.5    110,112.0     71,488    117,824      3,681.8  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
      0.4       47,817,595        256    186,787.5    185,823.0    184,735    216,255      5,259.4  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.4       44,066,285        256    172,133.9    171,455.5    169,471    191,391      3,494.3  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.4       42,214,467        644     65,550.4     88,591.0     25,024     99,487     32,658.3  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::func_wrapp…
      0.4       39,057,413        256    152,568.0    152,431.0    147,999    158,719      3,254.9  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<float, float, floa…
      0.3       38,212,459        256    149,267.4    149,663.5    145,407    154,303      2,952.5  void at::native::vectorized_elementwise_kernel<(int)4, at::native::neg_kernel_cuda(at::TensorIterat…
      0.2       25,162,498        990     25,416.7     14,880.0      1,664     76,160     26,167.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::FillFunctor<float>, std::array<c…
      0.2       19,800,351        128    154,690.2    154,735.5    145,503    157,567      1,301.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::exp_kernel_cuda(at::TensorIterat…
      0.2       18,556,130        128    144,969.8    144,991.5    140,799    146,047        562.8  void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_backward_kernel_cuda(at:…
      0.1       14,518,616        128    113,426.7    112,288.0    111,392    130,015      4,182.0  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MaxOps<flo…
      0.1       11,134,852        128     86,991.0     86,848.0     86,335     89,631        625.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::sigmoid_kernel_cuda(at::TensorIt…
      0.1        5,566,435        260     21,409.4     21,312.0     19,392     26,208        565.0  void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
      0.1        5,510,950        128     43,054.3     43,072.0     42,592     43,616        214.8  void at::native::<unnamed>::CatArrayBatchedCopy<at::native::<unnamed>::OpaqueType<(unsigned int)4>,…
      0.0        5,230,759        260     20,118.3     20,128.0     18,976     20,799        242.2  void at::native::reduce_kernel<(int)128, (int)4, at::native::ReduceOp<float, at::native::func_wrapp…
      0.0        5,217,579        520     10,033.8      9,376.0      1,888     20,608      7,635.3  void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<float, float, floa…
      0.0        4,816,299        260     18,524.2     18,431.5     17,311     22,560        794.4  void at::native::reduce_kernel<(int)512, (int)1, at::native::ReduceOp<float, at::native::MeanOps<fl…
      0.0        4,718,019        260     18,146.2     18,144.0     17,984     18,912         85.1  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.0        4,711,592        256     18,404.7     18,000.0     17,312     20,224        919.3  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.0        3,382,639        128     26,426.9     26,399.0     25,760     28,096        363.9  void at::native::<unnamed>::CatArrayBatchedCopy_aligned16_contig<at::native::<unnamed>::OpaqueType<…
      0.0        2,235,989        260      8,600.0      7,776.0      7,136     58,208      5,342.0  void at::native::index_elementwise_kernel<(int)128, (int)4, void at::native::gpu_index_kernel<void …
      0.0        1,470,523        128     11,488.5     11,424.0     10,624     12,448        334.7  void at::native::_scatter_gather_elementwise_kernel<(int)128, (int)4, void at::native::_cuda_scatte…
      0.0          718,396          4    179,599.0    179,471.0    178,847    180,607        733.7  void at::native::<unnamed>::cunn_SoftMaxBackward<(int)4, float, float, float, at::native::<unnamed>…
      0.0          676,317        260      2,601.2      2,560.0      2,336      3,072        135.2  void at::native::vectorized_elementwise_kernel<(int)4, void at::native::<unnamed>::pow_tensor_scala…
      0.0          601,178        128      4,696.7      4,640.0      4,448      5,760        278.6  void at::native::elementwise_kernel<(int)128, (int)4, void at::native::gpu_kernel_impl_nocast<at::n…
      0.0          567,356        260      2,182.1      2,144.0      2,016      2,816        152.1  void at::native::vectorized_elementwise_kernel<(int)4, at::native::CUDAFunctorOnSelf_add<float>, st…
      0.0          532,636        260      2,048.6      2,048.0      1,760      3,008        176.7  void <unnamed>::elementwise_kernel_with_index<int, at::native::arange_cuda_out(const c10::Scalar &,…
      0.0          521,918        260      2,007.4      1,984.0      1,888      2,560        111.7  void at::native::vectorized_elementwise_kernel<(int)4, at::native::rsqrt_kernel_cuda(at::TensorIter…
      0.0          437,374          4    109,343.5    109,343.5    109,023    109,664        273.3  void at::native::<unnamed>::cunn_SoftMaxForwardSmem<(int)4, float, float, float, at::native::<unnam…
      0.0          275,134          4     68,783.5     69,039.0     67,680     69,376        758.2  void at::native::<unnamed>::nll_loss_forward_reduce_cuda_kernel_2d<float, float, long>(T1 *, T1 *, …
      0.0          165,983          4     41,495.8     40,992.0     40,480     43,519      1,425.2  void <unnamed>::indexing_backward_kernel<float, (int)4>(const long *, const long *, const T1 *, T1 …
      0.0          156,287          4     39,071.8     39,072.0     37,951     40,192        920.8  void at::native::<unnamed>::nll_loss_backward_reduce_cuda_kernel_2d<float, long>(T1 *, const T1 *, …
      0.0           42,240          4     10,560.0     10,576.0     10,432     10,656         94.2  void at_cuda_detail::cub::DeviceRadixSortSingleTileKernel<at_cuda_detail::cub::DeviceRadixSortPolic…
      0.0           15,776          4      3,944.0      3,936.0      3,744      4,160        177.9  void at::native::elementwise_kernel<(int)128, (int)2, void at::native::gpu_kernel_impl_nocast<at::n…
      0.0           12,704          4      3,176.0      3,232.0      3,008      3,232        112.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<long, long, long, …
      0.0           10,400          4      2,600.0      2,592.0      2,592      2,624         16.0  void at::native::vectorized_elementwise_kernel<(int)4, at::native::BUnaryFunctor<long, long, long, …
      0.0            9,408          4      2,352.0      2,352.0      2,336      2,368         18.5  void at::native::vectorized_elementwise_kernel<(int)4, at::native::AUnaryFunctor<long, long, long, …
      0.0            3,552          1      3,552.0      3,552.0      3,552      3,552          0.0  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<unsigned int, (int)4, void …

[7/8] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)            Operation
 --------  ---------------  -----  -----------  -----------  --------  ----------  -----------  ------------------------------
     99.5    1,509,378,340    292  5,169,103.9  1,929,510.0     1,888  24,205,822  6,308,501.9  [CUDA memcpy Host-to-Device]
      0.5        7,530,656    268     28,099.5     25,503.0    24,896     120,959     14,041.7  [CUDA memcpy Device-to-Device]
      0.0          555,642    260      2,137.1      2,144.0     1,056       2,688         87.4  [CUDA memset]

[8/8] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)            Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------------------
 13,627.402    292    46.669    26.214     0.010   104.858       42.616  [CUDA memcpy Host-to-Device]
  6,189.235    268    23.094    20.972    20.972   102.400       12.254  [CUDA memcpy Device-to-Device]
      0.021    260     0.000     0.000     0.000     0.000        0.000  [CUDA memset]

Generated:
    /nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_train.nsys-rep
    /nfs/ofs-llm-ssd/user/gogongxt/Projects/cs336/assignment/assignment2-systems/nsys_results/2.7B_ctx512_train.sqlite
Saved: nsys_results/2.7B_ctx512_train.nsys-rep
这是一个非常详尽的NVIDIA Nsight Systems性能分析报告。这份报告对一个具有27亿参数、上下文长度为512的类GPT模型进行了三种不同场景的剖析。
首先，我们来解释一下每个测试测的是什么：
1. 测试内容说明

这三个测试对应了深度学习模型训练过程中的核心步骤：
Forward (前向传播)：这是模型根据输入数据计算预测结果的过程。在这个阶段，数据从输入层流向输出层，不涉及梯度计算和权重更新。它主要衡量模型推理或预测时的计算负担和耗时。
Backward (反向传播)：这是计算梯度的过程。在前向传播得到预测结果并计算出误差（Loss）后，反向传播从输出层开始，将误差梯度逐层传回输入层，计算出每个参数对最终误差的贡献程度（即梯度）。它衡量的是计算梯度的计算负担。
Train (训练-完整步骤)：这是前向传播、反向传播和优化器更新（Optimizer Step） 的完整组合。它代表了实际训练模型时一个完整迭代（Step） 的开销，是衡量训练吞吐量（Throughput）最直接的指标。
2. 报告结果总结

这份报告揭示了几个关于大模型训练和推理的、非常典型且重要的性能特征。我们可以从整体耗时、计算瓶颈、内存操作等几个维度来总结。
2.1 整体耗时与相对成本

首先，我们来对比一下三个测试的总耗时（基于 profiled_step 和主要阶段的数据）：
测试类型	核心步骤耗时 (profiled_step)	前向耗时 (forward_pass)	反向耗时 (backward_pass)	优化器耗时 (optimizer_step)
Forward	~845 ms	~845 ms	N/A	N/A
Backward	~2,581 ms	~843 ms	~1,734 ms	N/A
Train	~2,758 ms	~844 ms	~1,732 ms	~179 ms
关键洞察：
反向传播是前向传播的2倍开销：从 Backward 和 Train 测试中可以清晰地看到，对于一个完整的训练步骤，backward_pass (~1734 ms) 的时间大约是 forward_pass (~844 ms) 的 2.05倍。这印证了深度学习中一个普遍的经验法则：反向传播的计算量和内存访问量通常是前向传播的2-3倍。
优化器步骤相对开销较小：优化器更新（optimizer_step）只占了整个训练步骤的约 6.5% (179ms / 2758ms)。这表明在当前设置下，主要的性能瓶颈在于模型的前向和反向计算本身，而不是参数更新。
2.2 性能瓶颈分析：什么在耗时？

通过查看 NVTX 摘要（nvtx_sum），我们可以分解模型内部各部分的耗时（以 Train 测试为例）：
组件 (Range)	耗时 (ns)	占比 (在 profiled_step 中)	说明
`profiled_step`	2,758,088,295	100%	一个完整训练步
`forward_pass`	843,662,433	30.6%	前向传播
`backward_pass`	1,732,475,648	62.8%	反向传播
`optimizer_step`	178,760,308	6.5%	参数更新
`attention_sublayer`	1,896,613,679	68.8%	前向和反向中Attention的总耗时
`ffn_sublayer`	701,380,211	25.4%	前向和反向中FFN的总耗时
关键洞察：
注意力机制是绝对的计算热点：无论是前向还是反向，attention_sublayer 都占据了主导地位。在完整的训练步中，它占用了近 69% 的时间，而 ffn_sublayer 只占 25%。这凸显了在长序列（ctx=512）下，Attention机制 O(n²) 计算复杂度的巨大影响。
层间耗时存在差异：报告中详细列出了每一层（layer_0 到 layer_31）的耗时。可以发现，layer_0 的耗时（特别是反向传播时）明显高于其他层。这通常是因为第一层直接处理输入数据，其梯度计算可能涉及额外的数据重排布或与输入张量相关的操作。
2.3 GPU内核分析：底层在做什么？

CUDA GPU 内核摘要（cuda_gpu_kern_sum）揭示了底层最耗时的数学计算，以 Train 测试为例：
GPU 内核 (Name)	耗时占比	说明
`ampere_sgemm_128x64_tn`	27.3%	单精度通用矩阵乘法 (SGEMM)
`void cutlass::Kernel2<...sgemm...>`	20.3%	由 CUTLASS 库优化的矩阵乘法
`ampere_sgemm_128x32_sliced1x4_nn`	13.5%	另一种形状的矩阵乘法
`ampere_sgemm_128x128_nn`	7.8%	矩阵乘法
`ampere_sgemm_128x64_nn`	6.6%	矩阵乘法
`...elementwise_kernel...` 和 `...vectorized...`	~5-7%	逐元素操作 (如激活函数、Dropout、残差连接等)
关键洞察：
矩阵乘法（GEMM）是绝对的算力核心：前几项都是 sgemm（单精度通用矩阵乘法）相关的内核。将所有 sgemm 和相关内核的时间加起来，它们占据了 75% 以上 的 GPU 计算时间。这说明该模型的性能严重依赖于 GPU 的 Tensor Core 和 FP32 算力。
逐元素操作的开销不可忽视：虽然单个逐元素操作很快，但它们在模型中数量庞大。大量的 elementwise_kernel 调用累加起来也占用了约 5-10% 的时间。这些操作通常受限于内存带宽而非计算能力。
2.4 内存操作分析

从 cuda_gpu_mem_time_sum 和 cuda_gpu_mem_size_sum 报告中可以看出：
Host-to-Device 内存拷贝稳定：在所有三个测试中，Host-to-Device 的内存拷贝总量都是 ~13.6 GB，且都是在测试开始时发生的。这对应了将模型参数和输入数据（随机生成的批次） 从 CPU 内存拷贝到 GPU 显存的过程。一旦拷贝完成，后续的计算就不再需要大量的 CPU-GPU 通信。
Device-to-Device 拷贝只在有反向传播时出现：在 Backward 和 Train 测试中，出现了约 6.2 GB 的 Device-to-Device 内存拷贝。这通常发生在反向传播过程中，用于梯度累加、数据重排布或中间结果的暂存。
内存拷贝耗时占比较小：对比 cuda_gpu_mem_time_sum 和 cuda_gpu_kern_sum 的时间可以看出，内核执行时间（秒级）远远超过了内存拷贝时间（毫秒级）。这表明该训练过程是计算密集型而非数据/通信密集型的。
3. 总体结论

综合以上分析，可以得出以下结论：
瓶颈确认：该 2.7B 参数模型在上下文为512的训练中，是典型的计算密集型负载。主要瓶颈在于 GPU 的矩阵乘法算力，其次是内存带宽。
Attention 是关键：attention_sublayer 是最大的性能瓶颈，占据了近70%的训练时间。任何针对 Attention 机制的优化（如 FlashAttention）都将带来显著的性能提升。
反向传播开销巨大：反向传播的时间是前向传播的2倍以上，是训练成本的主要构成部分。
优化空间：
- 启用混合精度训练（Mixed Precision / AMP）：当前使用单精度（FP32）训练。切换到 FP16 或 BF16 并使用 Tensor Core 将能成倍地加速矩阵乘法（GEMM）内核，这是最有效的优化手段。
- 使用 FlashAttention：报告中未显示使用了 FlashAttention，其标准的 scaled_dot_product_attention 实现非常耗时。FlashAttention 通过优化内存访问模式，可以大幅降低 Attention 层的耗时和显存占用。
- 算子融合（Operator Fusion）：报告中大量的 elementwise_kernel 表明存在许多细碎的操作。通过编译器优化或手动实现融合内核（如在 TransformerEngine 或 torch.compile 中），可以减少内核启动开销和内存访问次数。