<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Computer Science Research and Development]]></title><description><![CDATA[Subscribe to receive occasional updates on computer science research and development articles, and other technical publications.]]></description><link>https://johnmusgrave.com</link><image><url>https://johnmusgrave.com/img/substack.png</url><title>Computer Science Research and Development</title><link>https://johnmusgrave.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 27 Jun 2026 01:31:13 GMT</lastBuildDate><atom:link href="https://johnmusgrave.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[John Musgrave]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[musgravejw@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[musgravejw@substack.com]]></itunes:email><itunes:name><![CDATA[John Musgrave]]></itunes:name></itunes:owner><itunes:author><![CDATA[John Musgrave]]></itunes:author><googleplay:owner><![CDATA[musgravejw@substack.com]]></googleplay:owner><googleplay:email><![CDATA[musgravejw@substack.com]]></googleplay:email><googleplay:author><![CDATA[John Musgrave]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[vLLM throughput-latency sweep on NVIDIA A100]]></title><description><![CDATA[Benchmarking Llama-3.1-8B-Instruct from one concurrent request to 64, and reading the result against the A100&#8217;s memory roofline.]]></description><link>https://johnmusgrave.com/p/vllm-throughput-latency-sweep-on</link><guid isPermaLink="false">https://johnmusgrave.com/p/vllm-throughput-latency-sweep-on</guid><dc:creator><![CDATA[John Musgrave]]></dc:creator><pubDate>Mon, 25 May 2026 18:19:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5M1l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>On a single NVIDIA A100 40GB, decode at one request per time runs at about 75 tokens/sec, roughly 77% of the GPU&#8217;s memory-bandwidth ceiling. Batching buys throughput almost for free up to a concurrency of 8, a 6.6x gain with almost no latency cost. Past that the curve bends: throughput keeps rising but per-token latency climbs and tail latency goes vertical. The knee sits at C=8, and that point, not peak throughput, is where you run a production system.</p><h2>Setup</h2><p>Everything run on a single node GPU, no tensor parallelism, with default vLLM scheduling.  The instance on Lambda also has 30 vCPU&#8217;s, 1800 GB RAM, and a 6TB SSD.</p><ul><li><p>GPU: NVIDIA A100 40GB (using Lambda)</p></li><li><p>Model: <code>meta-llama/Llama-3.1-8B-Instruct, BF16</code></p></li><li><p>vLLM 0.23.0</p><p></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;1caf1426-379c-4d7e-8fa9-dacad7a98a17&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">vllm serve meta-llama/Llama-3.1-8B-Instruct</code></pre></div></li></ul><p></p><h2>Method</h2><p>We can perform a concurrency sweep by varying the number of requests from 1 to 64, doubling each step. This varies the request load to measure the impact on latency and throughput.  Every request used a fixed 1024-token input and a fixed 128-token output, with <code>--ignore-eos</code> so the model always generates exactly 128 tokens. Fixed lengths keep the curve clean; variable lengths smear every point because each concurrency level ends up averaging over a different length distribution.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;3ef179f8-8f6b-438c-9f1d-cb1969c5d113&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">for C in 1 2 4 8 16 32 64; do
  vllm bench serve --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name random --random-input-len 1024 --random-output-len 128 \
    --ignore-eos --max-concurrency $C --num-prompts $((C*20)) \
    --request-rate inf --metadata vllm=0.23.0 gpu=a100-40gb \
    --save-result --result-filename sweep_c${C}.json
done</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5M1l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5M1l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png 424w, https://substackcdn.com/image/fetch/$s_!5M1l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png 848w, https://substackcdn.com/image/fetch/$s_!5M1l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png 1272w, https://substackcdn.com/image/fetch/$s_!5M1l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5M1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png" width="1050" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1050,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52312,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://johnmusgrave.com/i/203591904?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5M1l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png 424w, https://substackcdn.com/image/fetch/$s_!5M1l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png 848w, https://substackcdn.com/image/fetch/$s_!5M1l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png 1272w, https://substackcdn.com/image/fetch/$s_!5M1l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4802e8d4-fc4c-40ca-820e-5f3e947bc447_1050x750.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vd-B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vd-B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png 424w, https://substackcdn.com/image/fetch/$s_!Vd-B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png 848w, https://substackcdn.com/image/fetch/$s_!Vd-B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png 1272w, https://substackcdn.com/image/fetch/$s_!Vd-B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vd-B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png" width="1050" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1050,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://johnmusgrave.com/i/203591904?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Vd-B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png 424w, https://substackcdn.com/image/fetch/$s_!Vd-B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png 848w, https://substackcdn.com/image/fetch/$s_!Vd-B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png 1272w, https://substackcdn.com/image/fetch/$s_!Vd-B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb45f9279-0935-4a86-a4ed-245a59c6c52b_1050x750.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9gTG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9gTG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png 424w, https://substackcdn.com/image/fetch/$s_!9gTG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png 848w, https://substackcdn.com/image/fetch/$s_!9gTG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png 1272w, https://substackcdn.com/image/fetch/$s_!9gTG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9gTG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png" width="1050" height="750" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:750,&quot;width&quot;:1050,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75500,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://johnmusgrave.com/i/203591904?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9gTG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png 424w, https://substackcdn.com/image/fetch/$s_!9gTG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png 848w, https://substackcdn.com/image/fetch/$s_!9gTG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png 1272w, https://substackcdn.com/image/fetch/$s_!9gTG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b357b79-646f-4165-b4fb-d3710deaf074_1050x750.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div>]]></content:encoded></item><item><title><![CDATA[Pre-training an LLM base model (for under $20)]]></title><description><![CDATA[I was inspired by Andrej Karpathy&#8217;s Nanochat to train an LLM base model from scratch.]]></description><link>https://johnmusgrave.com/p/recreating-chatgpt-and-llm-training</link><guid isPermaLink="false">https://johnmusgrave.com/p/recreating-chatgpt-and-llm-training</guid><dc:creator><![CDATA[John Musgrave]]></dc:creator><pubDate>Mon, 16 Mar 2026 20:59:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!W8a0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I was inspired by Andrej Karpathy&#8217;s Nanochat to train an LLM base model from scratch. I tried to keep the budget under $20 using an A100 NVIDIA on Lambda.  I will make the weights from pre-training public.</p><h2>Base Model Training</h2><p>The Transformer architecture has 12 layers, which is enough to run in about an hour on a single GPU instance.</p><p>To calculate the parameters, we need to calculate the embedding dimensionality by multiplying the number of layers by the aspect ratio. (GPT-2 and Karpathy use 64). d = 12 * 64, or 768. So we have 4 attention matrices of 768x768, one for Q,K,V and a projected matrix. 4 * d^2 = 2,359,296. The feed-forward network in the Transformer are 4x the dimensionality, and there is a fully connected layer 768x3072 and a projected layer 3072x768. So each layer has about 7 Million parameters. We also need embeddings for the input and output, and these are the 50,357 vocab times the dimensionality. 38 Million * 2. The total parameters are 162 Million.</p><p>Following the Chinchilla Scaling Law, we will need ~20 tokens per parameter for the training set, which is 3.2 Billion tokens. By selecting a batch size of 0.5M, this results in ~6k iterations during training.</p><p></p><p></p><h2>Data: FineWeb-EDU</h2><p>Using FineWeb-EDU which is the same as Nanochat.  This is because the quality of the tokens is higher than FineWeb for smaller datasets.</p><p>My data step streams FineWeb-EDU from Hugging Face, tokenizes it with the gpt2 BPE tokenizer, and writes flat <code>uint16</code>token files that the trainer memory-maps. We can always come back later and train a tokenizer for more efficiency</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W8a0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W8a0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png 424w, https://substackcdn.com/image/fetch/$s_!W8a0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png 848w, https://substackcdn.com/image/fetch/$s_!W8a0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png 1272w, https://substackcdn.com/image/fetch/$s_!W8a0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W8a0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png" width="1050" height="645" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/edebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:645,&quot;width&quot;:1050,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67660,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://johnmusgrave.com/i/203609279?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W8a0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png 424w, https://substackcdn.com/image/fetch/$s_!W8a0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png 848w, https://substackcdn.com/image/fetch/$s_!W8a0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png 1272w, https://substackcdn.com/image/fetch/$s_!W8a0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedebf87b-c633-4c2c-ab97-55a485c141f5_1050x645.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>.</p><p></p><h2>Architecture</h2><p><strong>Rotary position embeddings.</strong> RoPE encodes position by rotating queries and keys instead of adding a learned position vector. Two consequences downstream. There is no positional table to store or to run off the end of, and you can stretch context past the training length by changing the rotation base. And because position lives inside Q and K, the KV cache stores keys that are already rotated, so cached entries stay correct as you decode. RoPE and the KV cache fit together by design.</p><p><strong>QK-norm.</strong> RMS-normalizing the queries and keys before attention bounds the size of the attention logits. At this scale it is mostly a training-stability win. The inference angle is quieter: bounded logits make low-precision attention less likely to overflow, which is exactly what you want when you are trying to run the score computation in fp8 or int8 to make decode cheaper.</p><p><strong>Grouped-query attention.</strong> This one is pure inference economics, which is why I left it in as an option even though the small models do not need it. During decode the KV cache is what eats memory and bandwidth, and its size scales with the number of key/value heads. Dropping from one KV head per query head down to a small shared group shrinks the cache by that ratio with little quality loss. Decode is memory-bandwidth bound, so a smaller cache is close to a direct speedup at long context and large batch. Anyone who has read a decode roofline reaches for GQA.</p><p><strong>Untied embeddings.</strong> nanochat does not share the input and output embedding matrices, so there are two <code>vocab x dim</code>matrices instead of one. At a 50k-plus vocab that is real parameters and real weight-load bandwidth, and the output matrix is a large matmul in prefill and again every step of decode. Worth keeping in your head when you account for where the FLOPs and the bytes actually go.</p><p><strong>Logit soft-cap.</strong> A tanh squashes the logits into a fixed range before the loss. Cheap, and it keeps the final big matmul&#8217;s outputs well-behaved for the same low-precision reasons as QK-norm.</p><p><strong>bf16 activations, fp32 master weights.</strong> The model keeps fp32 weights for the optimizer but casts to bf16 inside the matmuls. Standard training setup, and also the cleanest version of the precision boundary you redraw later for inference, where you push activations and often weights down to fp8 or int8 and keep high precision only where the dynamic range demands it.</p><p></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;c3f802a7-7276-4e20-b126-823e6eb697f0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">step   483/6184 | loss 3.5791 | bpb 1.1610 | lr 1.000 | 3508ms | 149,443 tok/s | MFU 29.8%
step   484/6184 | loss 3.5814 | bpb 1.0653 | lr 1.000 | 3509ms | 149,415 tok/s | MFU 29.8%
step   485/6184 | loss 3.5779 | bpb 1.0952 | lr 1.000 | 3512ms | 149,294 tok/s | MFU 29.8%
step   486/6184 | loss 3.5853 | bpb 1.0957 | lr 1.000 | 3519ms | 148,976 tok/s | MFU 29.7%
step   487/6184 | loss 3.5867 | bpb 1.0846 | lr 1.000 | 3510ms | 149,372 tok/s | MFU 29.8%
step   488/6184 | loss 3.5889 | bpb 1.0861 | lr 1.000 | 3509ms | 149,399 tok/s | MFU 29.8%
step   489/6184 | loss 3.5640 | bpb 1.1281 | lr 1.000 | 3509ms | 149,418 tok/s | MFU 29.8%
step   490/6184 | loss 3.5748 | bpb 1.1070 | lr 1.000 | 3522ms | 148,868 tok/s | MFU 29.7%
step   491/6184 | loss 3.5761 | bpb 1.1342 | lr 1.000 | 3512ms | 149,272 tok/s | MFU 29.8%
step   492/6184 | loss 3.5868 | bpb 1.0970 | lr 1.000 | 3509ms | 149,434 tok/s | MFU 29.8%
step   493/6184 | loss 3.5811 | bpb 1.0954 | lr 1.000 | 3508ms | 149,448 tok/s | MFU 29.8%
step   494/6184 | loss 3.6171 | bpb 1.1032 | lr 1.000 | 3510ms | 149,390 tok/s | MFU 29.8%
step   495/6184 | loss 3.5830 | bpb 1.1217 | lr 1.000 | 3505ms | 149,586 tok/s | MFU 29.9%
step   496/6184 | loss 3.6026 | bpb 1.1242 | lr 1.000 | 3510ms | 149,386 tok/s | MFU 29.8%
step   497/6184 | loss 3.5811 | bpb 1.0831 | lr 1.000 | 3508ms | 149,448 tok/s | MFU 29.8%
step   498/6184 | loss 3.5950 | bpb 1.1360 | lr 1.000 | 3507ms | 149,478 tok/s | MFU 29.8%
step   499/6184 | loss 3.5751 | bpb 1.1000 | lr 1.000 | 3508ms | 149,466 tok/s | MFU 29.8%
step   500/6184 | loss 3.5702 | bpb 1.1216 | lr 1.000 | 3505ms | 149,562 tok/s | MFU 29.9%
  &gt;&gt; sample: &lt;|endoftext|&gt;- The study of the Great Awakening in the Great Awakening - The study of the Great Awakening in the Great Awakening An overview of the study of the Great Awakening - The study of the Great Awakening in the Great Awakening - The study of the Great Awakening in the Great Awakening - The study of the Great</code></pre></div><p></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;c42e939c-005c-48b7-87f3-31fcaf2e8669&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">step   986/6184 | loss 3.3652 | bpb 1.0247 | lr 1.000 | 3503ms | 149,684 tok/s | MFU 29.9%
step   987/6184 | loss 3.3869 | bpb 1.0688 | lr 1.000 | 3505ms | 149,593 tok/s | MFU 29.9%
step   988/6184 | loss 3.3612 | bpb 1.0315 | lr 1.000 | 3503ms | 149,678 tok/s | MFU 29.9%
step   989/6184 | loss 3.3799 | bpb 1.0608 | lr 1.000 | 3502ms | 149,719 tok/s | MFU 29.9%
step   990/6184 | loss 3.3701 | bpb 1.0465 | lr 1.000 | 3502ms | 149,718 tok/s | MFU 29.9%
step   991/6184 | loss 3.3629 | bpb 1.1073 | lr 1.000 | 3505ms | 149,578 tok/s | MFU 29.9%
step   992/6184 | loss 3.3885 | bpb 1.0934 | lr 1.000 | 3499ms | 149,838 tok/s | MFU 29.9%
step   993/6184 | loss 3.3677 | bpb 1.0508 | lr 1.000 | 3493ms | 150,093 tok/s | MFU 30.0%
step   994/6184 | loss 3.3899 | bpb 1.0503 | lr 1.000 | 3490ms | 150,239 tok/s | MFU 30.0%
step   995/6184 | loss 3.3832 | bpb 1.0307 | lr 1.000 | 3491ms | 150,177 tok/s | MFU 30.0%
step   996/6184 | loss 3.3642 | bpb 1.0535 | lr 1.000 | 3488ms | 150,324 tok/s | MFU 30.0%
step   997/6184 | loss 3.3538 | bpb 1.0287 | lr 1.000 | 3489ms | 150,275 tok/s | MFU 30.0%
step   998/6184 | loss 3.3978 | bpb 1.0700 | lr 1.000 | 3490ms | 150,226 tok/s | MFU 30.0%
step   999/6184 | loss 3.3751 | bpb 1.0584 | lr 1.000 | 3488ms | 150,327 tok/s | MFU 30.0%
step  1000/6184 | loss 3.3593 | bpb 0.9976 | lr 1.000 | 3491ms | 150,202 tok/s | MFU 30.0%
  &gt;&gt; val bpb 1.0551
  &gt;&gt; sample: &lt;|endoftext|&gt;The study of the natural environment's response to climate change is the first attempt at exploring the potential of environmental change as a response to the global climate change. The study of the natural environment's response to climate change is one of the most important questions to be answered by Earth science practitioners. The study of the natural environment
  &gt;&gt; saved checkpoint at step 1000 -&gt; checkpoints/d12.pt
</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;7dab9c74-1b76-491e-93ac-11ddd95a4472&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">step  1487/6184 | loss 3.2938 | bpb 0.9987 | lr 1.000 | 3503ms | 149,668 tok/s | MFU 29.9%
step  1488/6184 | loss 3.2698 | bpb 1.0464 | lr 1.000 | 3502ms | 149,704 tok/s | MFU 29.9%
step  1489/6184 | loss 3.2671 | bpb 1.0203 | lr 1.000 | 3502ms | 149,709 tok/s | MFU 29.9%
step  1490/6184 | loss 3.2770 | bpb 1.0582 | lr 1.000 | 3503ms | 149,652 tok/s | MFU 29.9%
step  1491/6184 | loss 3.2754 | bpb 0.9966 | lr 1.000 | 3503ms | 149,668 tok/s | MFU 29.9%
step  1492/6184 | loss 3.2668 | bpb 1.0516 | lr 1.000 | 3504ms | 149,614 tok/s | MFU 29.9%
step  1493/6184 | loss 3.2773 | bpb 1.0605 | lr 1.000 | 3502ms | 149,716 tok/s | MFU 29.9%
step  1494/6184 | loss 3.2950 | bpb 1.0392 | lr 1.000 | 3502ms | 149,691 tok/s | MFU 29.9%
step  1495/6184 | loss 3.2619 | bpb 0.9845 | lr 1.000 | 3502ms | 149,691 tok/s | MFU 29.9%
step  1496/6184 | loss 3.2622 | bpb 1.0229 | lr 1.000 | 3503ms | 149,686 tok/s | MFU 29.9%
step  1497/6184 | loss 3.2663 | bpb 1.0718 | lr 1.000 | 3501ms | 149,757 tok/s | MFU 29.9%
step  1498/6184 | loss 3.2972 | bpb 1.0009 | lr 1.000 | 3500ms | 149,801 tok/s | MFU 29.9%
step  1499/6184 | loss 3.2808 | bpb 1.0536 | lr 1.000 | 3500ms | 149,792 tok/s | MFU 29.9%
step  1500/6184 | loss 3.2850 | bpb 0.9899 | lr 1.000 | 3500ms | 149,812 tok/s | MFU 29.9%
  &gt;&gt; sample: &lt;|endoftext|&gt;Videos &amp; videos Why this problem might be a problem with some parts of your body. - Injuries to the joints - Pregnancy and birth - A change in lifestyle - Physical injury How does my body develop? Your brain is composed of a layer of white matter. These white matter</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;4bbc666b-cde3-48b8-bdee-5f94d9b274bb&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">step  1986/6184 | loss 3.2342 | bpb 1.0269 | lr 1.000 | 3502ms | 149,707 tok/s | MFU 29.9%
step  1987/6184 | loss 3.1972 | bpb 0.9660 | lr 1.000 | 3504ms | 149,647 tok/s | MFU 29.9%
step  1988/6184 | loss 3.2262 | bpb 1.0617 | lr 1.000 | 3505ms | 149,586 tok/s | MFU 29.9%
step  1989/6184 | loss 3.2300 | bpb 0.9829 | lr 1.000 | 3505ms | 149,596 tok/s | MFU 29.9%
step  1990/6184 | loss 3.2074 | bpb 1.0365 | lr 1.000 | 3504ms | 149,620 tok/s | MFU 29.9%
step  1991/6184 | loss 3.2127 | bpb 1.0154 | lr 1.000 | 3500ms | 149,803 tok/s | MFU 29.9%
step  1992/6184 | loss 3.2274 | bpb 0.9487 | lr 1.000 | 3499ms | 149,829 tok/s | MFU 29.9%
step  1993/6184 | loss 3.2223 | bpb 0.9896 | lr 1.000 | 3502ms | 149,720 tok/s | MFU 29.9%
step  1994/6184 | loss 3.2132 | bpb 0.9446 | lr 1.000 | 3503ms | 149,667 tok/s | MFU 29.9%
step  1995/6184 | loss 3.2187 | bpb 0.9899 | lr 1.000 | 3525ms | 148,738 tok/s | MFU 29.7%
step  1996/6184 | loss 3.2127 | bpb 0.9932 | lr 1.000 | 3549ms | 147,743 tok/s | MFU 29.5%
step  1997/6184 | loss 3.1791 | bpb 0.9453 | lr 1.000 | 3563ms | 147,157 tok/s | MFU 29.4%
step  1998/6184 | loss 3.2242 | bpb 1.0276 | lr 1.000 | 3563ms | 147,149 tok/s | MFU 29.4%
step  1999/6184 | loss 3.2165 | bpb 0.9426 | lr 1.000 | 3564ms | 147,106 tok/s | MFU 29.4%
step  2000/6184 | loss 3.2317 | bpb 0.9720 | lr 1.000 | 3552ms | 147,588 tok/s | MFU 29.5%
  &gt;&gt; val bpb 1.0177
  &gt;&gt; sample: &lt;|endoftext|&gt;The history of the - military may be divided into three periods: war, war-oriented and war of attrition. The - military is a new type of force with the purpose of strengthening -&#8217;s military presence in the West and spreading the message of peace and security, and of bringing the - military to a new
  &gt;&gt; saved checkpoint at step 2000 -&gt; checkpoints/d12.pt
</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;ef879e87-9479-4b46-80f4-04a12607f93d&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">step  2487/6184 | loss 3.1387 | bpb 0.9388 | lr 1.000 | 3520ms | 148,946 tok/s | MFU 29.7%
step  2488/6184 | loss 3.1586 | bpb 1.0009 | lr 1.000 | 3505ms | 149,593 tok/s | MFU 29.9%
step  2489/6184 | loss 3.1931 | bpb 1.0120 | lr 1.000 | 3501ms | 149,759 tok/s | MFU 29.9%
step  2490/6184 | loss 3.2069 | bpb 0.9483 | lr 1.000 | 3504ms | 149,634 tok/s | MFU 29.9%
step  2491/6184 | loss 3.2065 | bpb 0.9902 | lr 1.000 | 3506ms | 149,554 tok/s | MFU 29.9%
step  2492/6184 | loss 3.1798 | bpb 0.9296 | lr 1.000 | 3511ms | 149,329 tok/s | MFU 29.8%
step  2493/6184 | loss 3.1730 | bpb 0.9887 | lr 1.000 | 3510ms | 149,372 tok/s | MFU 29.8%
step  2494/6184 | loss 3.1976 | bpb 1.0165 | lr 1.000 | 3503ms | 149,650 tok/s | MFU 29.9%
step  2495/6184 | loss 3.1854 | bpb 0.9901 | lr 1.000 | 3513ms | 149,261 tok/s | MFU 29.8%
step  2496/6184 | loss 3.1764 | bpb 0.9885 | lr 1.000 | 3503ms | 149,656 tok/s | MFU 29.9%
step  2497/6184 | loss 3.1806 | bpb 0.9894 | lr 1.000 | 3489ms | 150,266 tok/s | MFU 30.0%
step  2498/6184 | loss 3.1626 | bpb 1.0217 | lr 1.000 | 3481ms | 150,598 tok/s | MFU 30.1%
step  2499/6184 | loss 3.1928 | bpb 0.9699 | lr 1.000 | 3482ms | 150,555 tok/s | MFU 30.1%
step  2500/6184 | loss 3.1646 | bpb 0.9211 | lr 1.000 | 3478ms | 150,726 tok/s | MFU 30.1%
  &gt;&gt; sample: &lt;|endoftext|&gt;|This page is part of:| |This guide is also available in: English| |This article needs additional citations for verification. (February 2013)| |The name of a person or company of the United States; a branch of foreign business.| |This section needs additional citations for verification.
  &gt;&gt; saved checkpoint at step 2500 -&gt; checkpoints/d12.pt</code></pre></div>]]></content:encoded></item><item><title><![CDATA[Implementing a Transformer From Scratch]]></title><description><![CDATA[Building a Decoder-Only Transformer From Scratch (and What Changed Since 2017)]]></description><link>https://johnmusgrave.com/p/implementing-a-transformer-from-scratch</link><guid isPermaLink="false">https://johnmusgrave.com/p/implementing-a-transformer-from-scratch</guid><dc:creator><![CDATA[John Musgrave]]></dc:creator><pubDate>Mon, 05 Jan 2026 22:04:00 GMT</pubDate><content:encoded><![CDATA[<h1>Building a Decoder-Only Transformer From Scratch (and What Changed Since 2017)</h1><p>Almost every transformer people meet today is a decoder-only stack: GPT and everything downstream of it. The paper that started all of it, &#8220;Attention Is All You Need,&#8221; describes something larger: an encoder-decoder model built for machine translation. The decoder-only architecture is what you get when you take that model, throw away the encoder and the cross-attention, and causally mask the self-attention that remains. I built it from scratch in PyTorch, no <code>nn.Transformer</code>, as exactly that reduction, because the fastest way to understand what is essential for autoregressive generation is to remove everything that is not.</p><p>The short version up front: the spine of the 2017 model survives intact. Attention, the residual stream, and one specific scaling factor are still the core. What you drop is the entire reading half of the network, and what you change since is repair work around the edges for training stability and serving cost. Here is the build, and the deltas.</p><h2>The shape</h2><p>The full Transformer is two stacks. An encoder reads the source sequence and produces context vectors, and a decoder generates the target one token at a time while attending both to its own past output and to the encoder. A decoder-only model keeps only the second stack, and only part of it.</p><p>Concretely, a full decoder block has three sublayers: masked self-attention, cross-attention to the encoder, and a feed-forward block. Remove the encoder and there is nothing for cross-attention to attend to, so it goes too. The block collapses to two sublayers:</p><pre><code><code>class DecoderLayer(nn.Module):
    def forward(self, x, mask):
        x = self.sub[0](x, lambda x: self.self_attn(x, x, x, mask))  # causal self-attention
        return self.sub[1](x, self.ff)                               # feed-forward
</code></code></pre><p>That is the whole structural change. A decoder-only block is just a transformer block whose self-attention is not allowed to look forward. Everything else in this post is either a piece carried over unchanged from the paper or a detail that is easy to get subtly wrong.</p><h2>Scaled dot-product attention, and the square root</h2><p>The center of the model is four lines, identical to the original:</p><pre><code><code>scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
scores = scores.masked_fill(mask == 0, -1e9)
attn = scores.softmax(dim=-1)
return attn @ value, attn
</code></code></pre><p>The one part of this people skip past is the division by the square root of <code>d_k</code>. It is not cosmetic. The dot product of two vectors with unit-variance components and dimension <code>d_k</code> has variance <code>d_k</code>. Without the scaling, as you widen the head dimension the raw scores grow, the softmax saturates, it puts almost all of its mass on a single key, and the gradient through it collapses toward zero. The square root puts the variance back near one so the softmax stays in a regime where it can learn. This factor has survived every architecture change since 2017, and it is invisible until you write attention out yourself.</p><h2>Causal masking is now the whole point</h2><p>In the full model the mask was one detail among several. In a decoder-only model it is the defining feature, because it is the only thing distinguishing this block from an encoder block. The rule is that position <code>i</code> may attend to positions up to and including <code>i</code>, never beyond. That is a lower-triangular mask, combined by a logical AND with a padding mask when you batch ragged sequences:</p><pre><code><code>def make_mask(seq, pad_idx):
    pad = (seq != pad_idx).unsqueeze(1)              # (B, 1, seq)
    causal = subsequent_mask(seq.size(1))            # (1, seq, seq), lower-triangular
    return pad &amp; causal                              # (B, seq, seq)
</code></code></pre><p>Get the broadcasting wrong by one axis and the model still runs, still trains, and silently lets position <code>i</code> see token <code>i+1</code>. There is no error, just a loss that looks suspiciously good because the model is reading the answer it is supposed to predict. The only defense is understanding exactly what each dimension means. This is the single bug that the from-scratch exercise is most worth doing to internalize.</p><h2>The details carried over from the paper</h2><p>Three more choices come straight from 2017 and are easy to drop. Embeddings are multiplied by the square root of the model dimension before positions are added, which keeps their magnitude comparable to the position signal. Positions themselves are fixed sinusoids, a bank of sines and cosines at geometrically spaced frequencies, rather than learned parameters. And the LayerNorm sits after the residual add: <code>LayerNorm(x + sublayer(x))</code>, which the paper calls post-norm.</p><p>Hold onto that last one, because post-norm is the decision that aged the worst, and it is the first thing a modern decoder-only model changes.</p><h2>Checking it actually works</h2><p>A from-scratch model that runs is not a from-scratch model that is correct. The full Transformer has a clean smoke test, the copy task, where the encoder reads a sequence and the decoder reproduces it. A decoder-only model has no encoder and no source, so that test does not apply directly. The right analog is the in-context copy task: each training example is a random sequence, a separator token, then the same sequence again. The model is trained on next-token prediction over only the second copy, so to do well it has to look back past the separator and reproduce the first occurrence.</p><p>This is a real capability test, not a trivial one. Reproducing an earlier span requires the model to form what the interpretability literature calls an induction head, and that provably needs at least two attention layers, one to find the previous occurrence and one to copy what followed it. A two-layer model learns it in a few hundred steps. Loss falls from about 6 down to roughly zero, and greedy generation reproduces held-out prompts token for token. If your causal mask is wrong, this is again where it surfaces, because a model that can see the future copy will drive its loss to zero through a route that has nothing to do with attention.</p><h2>What changed since 2017, and why decoder-only won</h2><p>With the model built, the interesting part is holding it next to one from this year and listing what moved.</p><p>Normalization moved first. Post-norm, the original choice, is hard to train at depth. Its gradients are badly enough behaved that the paper needed a learning-rate warmup to keep early training from diverging. Pre-norm, <code>x + sublayer(LayerNorm(x))</code>, puts the normalization inside the residual branch instead of across it, which keeps a clean gradient path straight down the stack. It trains deeper and more stably, often without warmup, and it is now the default in essentially every decoder-only model. This is the one place the original is simply worse, and it is a one-line change.</p><p>Positions moved twice. Fixed sinusoids gave way to learned absolute positions, and learned absolute gave way to rotary embeddings, which encode position by rotating the query and key vectors so attention scores depend on relative offset. Rotary extrapolates better to longer contexts and is what most current open models use.</p><p>The reason the decoder-only shape won at all is partly simplicity and how cleanly it scales, but a large part is serving. An autoregressive decoder generates one token per step, and because of causal masking the representation of every earlier token is fixed once computed. So you cache the keys and values you have already produced and never recompute them. That KV cache turns generation from quadratic recompute into a linear append, and it makes a single causal stack the natural, efficient unit of generation in a way an encoder-decoder model is not. The architecture that is cheapest to serve is the one that took over.</p><p>That same serving pressure produced the next change to attention itself. The original gives every head its own keys and values. Multi-query and grouped-query attention share keys and values across heads, which shrinks the KV cache by a large factor and is the main reason long-context generation is affordable. That change exists almost entirely for inference, not for quality.</p><p>Smaller deltas round it out. The ReLU in the feed-forward block became GELU and then gated variants like SwiGLU, usually with the inner dimension trimmed to hold the parameter count roughly fixed despite the extra gate. None of these touch the skeleton.</p><h2>Takeaway</h2><p>A decoder-only transformer is the 2017 model reduced to exactly what autoregressive generation needs, plus a handful of patches for stability and serving. The encoder and cross-attention are gone, the self-attention is causally masked, and the spine that remains, scaled dot-product attention over a residual stream with normalization and added positions, is the same one the paper drew eight years ago. The square root that keeps the softmax sane is unchanged.</p><p>That is the argument for building it by hand as a reduction rather than from a template. Once you have removed the encoder yourself, placed the causal mask, and watched the loss collapse on a task that depends on it, every modern architecture diagram stops looking like a new thing and starts looking like this block with the LayerNorm moved and the KV cache shrunk. You can read which part is load-bearing and which part is someone optimizing the serving bill.</p>]]></content:encoded></item><item><title><![CDATA[Support Vector Machine Classification of Data Dependency Graphs]]></title><description><![CDATA[I will be presenting this paper at the IEEE NAECON 2025 conference.]]></description><link>https://johnmusgrave.com/p/support-vector-machine-classification</link><guid isPermaLink="false">https://johnmusgrave.com/p/support-vector-machine-classification</guid><dc:creator><![CDATA[John Musgrave]]></dc:creator><pubDate>Wed, 09 Jul 2025 01:10:00 GMT</pubDate><content:encoded><![CDATA[<p><strong>I will be presenting this paper at the IEEE NAECON 2025 conference.</strong></p><p><strong><br></strong><a href="https://ieeexplore.ieee.org/abstract/document/11235449">https://ieeexplore.ieee.org/abstract/document/11235449</a></p><p>Abstract:   Support Vector Machine approaches provide efficient classification in data sets with high dimensionality.  In prior studies we have presented classification results in a high dimensional a metric space constructed from data dependency graphs.  These features were demonstrated to be tied to ground truth class labels, and therefore correlated to operational semantics.  The dimensionality of the metric space is derived from the quantity of isomorphically unique data dependency graphs within the data set.  While successive refinement can reduce the dimensionality and search space, dimensionality at the most coarse-grained level remains very high.  We present results from Support Vector Machine classifiers and show high accuracy with low false positive rates using features correlated to operational semantics.  We train classifiers on the Kaggle 2015 Microsoft Malware data set using a linear SVM classifier (One-vs-Rest and One-vs-One), SVM classifier with RBF kernel, SVM classifier with polynomial kernel, and an SVM classifier with custom kernel based on computing the pairwise Hamming distance.  This study obtains a total accuracy of over 93% for a multi-class classification problem using a Linear SVM (One-vs-Rest) classifier.  The Linear SVM classifier has the lowest false positive rate and high precision, with 6 of 9 classes having precision above 90%, high F1 scores, and a ROC AUC of 0.98.  Non-linear SVM kernels show a decrease in total accuracy which indicates linearity in the associated feature space.  The classifier was trained on features correlated with binary operational semantics which are demonstrated to be tied to ground truth class labels.</p>]]></content:encoded></item><item><title><![CDATA[kNN Classification of Malware Data Dependency Graph Features]]></title><description><![CDATA[I will be presenting this paper at the IEEE NAECON 2024 conference.]]></description><link>https://johnmusgrave.com/p/knn-classification-of-malware-data</link><guid isPermaLink="false">https://johnmusgrave.com/p/knn-classification-of-malware-data</guid><dc:creator><![CDATA[John Musgrave]]></dc:creator><pubDate>Tue, 04 Jun 2024 19:36:00 GMT</pubDate><content:encoded><![CDATA[<p><strong>I will be presenting this paper at the IEEE NAECON 2024 conference.</strong></p><p></p><p><a href="https://ieeexplore.ieee.org/document/10670673">https://ieeexplore.ieee.org/document/10670673</a></p><p>Abstract:  &#8220;Explainability of classification results is dependent upon the features used in classification. Data dependency graph features representing data movement are directly correlated with operational semantics, and subject to fine grained analysis. This study obtains accurate classification from the use of features tied to structure and semantics. By training an accurate model using labeled data, this feature representation of semantics is shown to be correlated with ground truth labels. This was performed using non-parametric learning with a novel feature representation on a large scale dataset, the Kaggle 2015 Malware dataset. The features used enable fine grained analysis, increase in resolution, and explainable inferences. This allows for the body of the term frequency distribution to be further analyzed and to provide an increase in feature resolution over term frequency features. This method obtains high accuracy from analysis of a single instruction, a method that can be repeated for additional instructions to obtain further increases in accuracy. This study evaluates the hypothesis that the semantic representation and analysis of structure are able to make accurate predictions that are also correlated to ground truth labels. Additionally, similarity in the metric space can be calculated directly without prior training. Our results provide evidence that data dependency graphs accurately capture both semantic and structural information for increased explainability in classification results.&#8221;</p><p></p><pre><code>@INPROCEEDINGS{10670673,
  author={Musgrave, John and Ralescu, Anca},
  booktitle={NAECON 2024 - IEEE National Aerospace and Electronics Conference}, 
  title={kNN Classification of Malware Data Dependency Graph Features}, 
  year={2024},
  volume={},
  number={},
  pages={206-213},
  keywords={Training;Measurement;Accuracy;Semantics;Aerospace electronics;Feature extraction;Malware;machine learning;feature extraction;malware analysis},
  doi={10.1109/NAECON61878.2024.10670673}
}</code></pre>]]></content:encoded></item><item><title><![CDATA[Some recent papers]]></title><description><![CDATA[Search and Retrieval in Semantic-Structural Representations of Novel Malware]]></description><link>https://johnmusgrave.com/p/some-recent-papers</link><guid isPermaLink="false">https://johnmusgrave.com/p/some-recent-papers</guid><dc:creator><![CDATA[John Musgrave]]></dc:creator><pubDate>Wed, 20 Mar 2024 18:00:00 GMT</pubDate><content:encoded><![CDATA[<h2>Search and Retrieval in Semantic-Structural Representations of Novel Malware</h2><p>Abstract:  &#8220;In this study, we present a novel representation for binary programs which captures semantic similarity and structural properties. This representation enables the search and retrieval of binary executable programs based on their similarity of behavioral properties. The proposed representation is composed in a bottom-up approach: we begin by extracting data dependency graphs (DDG), which are representative of both program structure and operational semantics. We then encode each program as a set of graph hashes representing isomorphic uniqueness, a method we have labeled DDG Fingerprinting. We present experimental results of search using k-Nearest Neighbors in a metric space constructed from a set of binary executables. Searches in the dataset are based on the operational semantics of specific malware examples.&#8221;</p><p></p><p><a href="http://dx.doi.org/10.54364/aaiml.2024.41117">http://dx.doi.org/10.54364/aaiml.2024.41117</a></p><pre><code>@article{musgrave2024search, title={Search and Retrieval in Semantic-Structural Representations of Novel Malware}, author={Musgrave, John and Campan, Alina and Messay-Kebede, Temesguen and Kapp, David and Wang, Boyang}, journal={Advances in Artificial Intelligence and Machine Learning}, volume={4}, number={1}, pages={117}, year={2024} }</code></pre><p></p><h2>Empirical Network Structure of Malicious Programs</h2><p>Abstract:  &#8220;A modern binary executable is a composition of various types of networks. Control flow graphs are a commonly used representation of an executable program used for classification tasks. Control flow and term frequency representations are widely adopted, but provide only a partial view of program semantics and present challenges to increases in resolution. By performing a quantitative analysis of program networks, we enable the identification of patterns within these features that are correlated to structure. This allows for increases in feature resolution and pattern recognition in classification tasks. These are necessary steps in order to obtain greater explainability in classification results. We demonstrate the presence of Scale-Free properties of network structure for program data dependency and control flow graphs, and show that data dependency graphs also have Small-World structural properties. We show that program data dependency graphs have a degree correlation that is structurally disassortative, and that control flow graphs have a neutral degree assortativity, indicating the use of random graphs to model the structural properties of program control flow graphs would show increased accuracy. An increase in feature resolution allows for the structural properties of program classes to be analyzed for patterns as well as their component parts. By providing an increase in feature resolution within labeled datasets of executable programs we provide a quantitative basis to interpret the results of classifiers trained on CFG graph features. By capturing a complete picture of program networks we can enable future work in mapping a program&#8217;s operational semantics to its structure.&#8220;</p><p></p><p><a href="http://dx.doi.org/10.54364/aaiml.2024.41112">http://dx.doi.org/10.54364/aaiml.2024.41112</a></p><pre><code>@article{musgrave2024empirical, title={Empirical Network Structure of Malicious Programs}, author={Musgrave, John and Campan, Alina and Messay-Kebede, Temesguen and Kapp, David and Wang, Boyang}, journal={Advances in Artificial Intelligence and Machine Learning}, volume={4}, number={1}, pages={112}, year={2024} } </code></pre>]]></content:encoded></item></channel></rss>