05 rl intro

import%20marimo%0A%0A__generated_with%20%3D%20%220.19.9%22%0Aapp%20%3D%20marimo.App(width%3D%22medium%22)%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20import%20marimo%20as%20mo%0A%0A%20%20%20%20return%20(mo%2C)%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%20RL%20at%20Scale%3A%20The%20Systems%20Problem%0A%0A%20%20%20%20You've%20learned%20how%20Monarch%20organizes%20distributed%20work%3A%20actors%20communicate%0A%20%20%20%20through%20endpoints%2C%20meshes%20provide%20structured%20parallelism%2C%20and%20supervision%0A%20%20%20%20trees%20keep%20things%20running%20when%20GPUs%20fail.%20Now%20the%20question%3A%20what%20do%20you%0A%20%20%20%20**build**%20with%20all%20of%20this%3F%0A%0A%20%20%20%20Increasingly%2C%20**RL**%20is%20one%20of%20the%20most%20natural%20use%20cases.%20From%20RLHF%20for%0A%20%20%20%20alignment%20to%20reasoning%20capabilities%20(o1-style%20thinking)%20to%20agentic%20tool%0A%20%20%20%20use%2C%20RL%20is%20a%20core%20part%20of%20the%20modern%20LLM%20pipeline.%20And%20RL%20at%20scale%20is%0A%20%20%20%20fundamentally%20a%20distributed%20systems%20problem%20%E2%80%94%20generators%20producing%0A%20%20%20%20rollouts%2C%20trainers%20consuming%20them%2C%20weights%20flowing%20between%20them%20%E2%80%94%20that%0A%20%20%20%20maps%20naturally%20to%20the%20actor%20model.%0A%0A%20%20%20%20This%20notebook%20establishes%20the%20groundwork%3A%0A%0A%20%20%20%201.%20**The%20task**%20%E2%80%94%20identifying%20a%20synthetic%20benchmark%20where%20the%20model%20has%20room%20to%20improve%0A%20%20%20%202.%20**The%20reward**%20%E2%80%94%20fully%20verifiable%2C%20no%20ambiguity%0A%20%20%20%203.%20**The%20bottleneck**%20%E2%80%94%20why%20synchronous%20RL%20wastes%20GPUs%0A%20%20%20%204.%20**The%20architecture**%20%E2%80%94%20async%20RL%2C%20and%20why%20it%20maps%20to%20actors%0A%0A%20%20%20%20But%20first%3A%20**when%20does%20RL%20actually%20need%20distributed%20systems%3F**%20Let's%20build%0A%20%20%20%20intuition.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20import%20os%0A%20%20%20%20os.environ%5B%22CUDA_VISIBLE_DEVICES%22%5D%20%3D%20%220%22%0A%20%20%20%20os.environ%5B%22HF_HUB_OFFLINE%22%5D%20%3D%20%221%22%0A%0A%20%20%20%20import%20torch%0A%20%20%20%20from%20transformers%20import%20AutoModelForCausalLM%2C%20AutoTokenizer%0A%0A%20%20%20%20MODEL_NAME%20%3D%20%22Qwen%2FQwen2.5-0.5B-Instruct%22%0A%0A%20%20%20%20print(f%22Loading%20%7BMODEL_NAME%7D...%22)%0A%20%20%20%20tokenizer%20%3D%20AutoTokenizer.from_pretrained(MODEL_NAME)%0A%0A%20%20%20%20device%20%3D%20%22cuda%22%20if%20torch.cuda.is_available()%20else%20%22cpu%22%0A%20%20%20%20model%20%3D%20AutoModelForCausalLM.from_pretrained(%0A%20%20%20%20%20%20%20%20MODEL_NAME%2C%0A%20%20%20%20%20%20%20%20torch_dtype%3Dtorch.float16%20if%20device%20%3D%3D%20%22cuda%22%20else%20torch.float32%2C%0A%20%20%20%20%20%20%20%20device_map%3D%22auto%22%20if%20device%20%3D%3D%20%22cuda%22%20else%20None%2C%0A%20%20%20%20)%0A%0A%20%20%20%20print(f%22Loaded%20on%20%7Bdevice%7D%20(%7Bsum(p.numel()%20for%20p%20in%20model.parameters())%20%2F%201e6%3A.1f%7DM%20params)%22)%0A%20%20%20%20return%20device%2C%20model%2C%20tokenizer%2C%20torch%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20GSM8K%3A%20A%20Classic%20RL%20Target%0A%0A%20%20%20%20GSM8K%20(grade-school%20math)%20is%20one%20of%20the%20most%20popular%20RL%20benchmarks%20for%0A%20%20%20%20LLMs.%20It's%20a%20great%20RL%20target%20because%20it%20has%20**verifiable%20rewards**%20(we%20can%0A%20%20%20%20check%20the%20numerical%20answer)%20and%20the%20model%20generates%20**reasoning%20traces**%0A%20%20%20%20we%20can%20inspect.%20Let's%20see%20what%20that%20looks%20like%3A%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(device%2C%20model%2C%20tokenizer%2C%20torch)%3A%0A%20%20%20%20%23%20Fix%20seed%20for%20reproducible%20output%0A%20%20%20%20torch.manual_seed(42)%0A%20%20%20%20if%20torch.cuda.is_available()%3A%0A%20%20%20%20%20%20%20%20torch.cuda.manual_seed_all(42)%0A%0A%20%20%20%20gsm8k_problem%20%3D%20%22%22%22Janet's%20ducks%20lay%2016%20eggs%20per%20day.%20She%20eats%20three%20for%20breakfast%20every%20morning%20and%20bakes%20muffins%20for%20her%20friends%20every%20day%20with%20four.%20She%20sells%20the%20remainder%20at%20the%20farmers'%20market%20daily%20for%20%242%20per%20fresh%20duck%20egg.%20How%20much%20in%20dollars%20does%20she%20make%20every%20day%20at%20the%20farmers'%20market%3F%22%22%22%0A%0A%20%20%20%20messages%20%3D%20%5B%0A%20%20%20%20%20%20%20%20%7B%22role%22%3A%20%22system%22%2C%20%22content%22%3A%20%22You%20are%20a%20helpful%20assistant.%20Solve%20the%20math%20problem%20step%20by%20step%2C%20then%20give%20the%20final%20numerical%20answer.%22%7D%2C%0A%20%20%20%20%20%20%20%20%7B%22role%22%3A%20%22user%22%2C%20%22content%22%3A%20gsm8k_problem%7D%0A%20%20%20%20%5D%0A%0A%20%20%20%20text%20%3D%20tokenizer.apply_chat_template(messages%2C%20tokenize%3DFalse%2C%20add_generation_prompt%3DTrue)%0A%20%20%20%20inputs%20%3D%20tokenizer(text%2C%20return_tensors%3D%22pt%22).to(device)%0A%0A%20%20%20%20with%20torch.no_grad()%3A%0A%20%20%20%20%20%20%20%20outputs%20%3D%20model.generate(%0A%20%20%20%20%20%20%20%20%20%20%20%20**inputs%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20max_new_tokens%3D512%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20temperature%3D0.7%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20do_sample%3DTrue%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20pad_token_id%3Dtokenizer.eos_token_id%2C%0A%20%20%20%20%20%20%20%20)%0A%0A%20%20%20%20response%20%3D%20tokenizer.decode(outputs%5B0%5D%5Binputs%5B%22input_ids%22%5D.shape%5B1%5D%3A%5D%2C%20skip_special_tokens%3DTrue)%0A%0A%20%20%20%20print(%22PROBLEM%3A%22)%0A%20%20%20%20print(gsm8k_problem)%0A%20%20%20%20print(%22%5CnMODEL%20RESPONSE%3A%22)%0A%20%20%20%20print(response)%0A%20%20%20%20print(%22%3D%22*10)%0A%20%20%20%20print(%22%5CnThe%20correct%20answer%20should%20be%3A%20%2418%20%20%E2%86%92%20%20(16%20-%203%20-%204)%20%C3%97%20%242%20%3D%209%20%C3%97%20%242%20%3D%20%2418%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Even%20at%200.5B%20parameters%2C%20Qwen%202.5%20can%20generate%20a%20convincing%20looking%20reasoning%20trace%0A%20%20%20%20%20%E2%80%94%20but%20it%20doesn't%20always%20produce%20the%20right%20answer.%0A%0A%20%20%20%20That's%20actually%20a%20nice%20property%3A%20the%20fact%0A%20%20%20%20that%20the%20model%20*sometimes*%20gets%20answers%20right%20gives%20us%20confidence%20that%0A%20%20%20%20further%20training%20can%20elicit%20more%20of%20this%20behavior.%20If%20it%20never%20succeeded%2C%0A%20%20%20%20RL%20would%20have%20no%20positive%20signal%20to%20reinforce.%0A%0A%20%20%20%20GSM8K%20shows%20the%20basic%20RL%20setup%3A%20the%20model%20generates%20a%20reasoning%20trace%2C%0A%20%20%20%20we%20check%20the%20final%20answer%2C%20and%20we%20get%20a%20binary%20reward.%20This%20is%20the%0A%20%20%20%20pattern%20behind%20DeepSeek-R1%20and%20similar%20work.%0A%0A%20%20%20%20But%20RL%20tasks%20come%20in%20many%20flavors.%20Tool-use%20tasks%20are%20more%20complex%20%E2%80%94%20the%0A%20%20%20%20model%20must%20call%20external%20APIs%2C%20chain%20results%2C%20and%20do%20arithmetic%20on%0A%20%20%20%20retrieved%20values.%20These%20tasks%20also%20produce%20**variable-length%0A%20%20%20%20trajectories**%2C%20which%20will%20matter%20when%20we%20talk%20about%20the%20systems%0A%20%20%20%20architecture.%0A%0A%20%20%20%20For%20this%20tutorial%2C%20we'll%20use%20a%20fully%20synthetic%20tool-use%20benchmark%20where%0A%20%20%20%20we%20control%20everything%20%E2%80%94%20difficulty%2C%20ground%20truth%2C%20and%20latency%20profile.%0A%0A%20%20%20%20Enter%20the%20**Zorplex%20Task**.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20The%20Zorplex%20Task%3A%20A%20Fully%20Synthetic%20Benchmark%0A%0A%20%20%20%20**Zorplex%20values**%20are%20completely%20made-up%20numbers%20assigned%20to%20words.%20The%20model%20has%20never%20seen%20them%20before%20and%20cannot%20guess%20them.%0A%0A%20%20%20%20**Why%20synthetic%20benchmarks%20matter%3A**%0A%20%20%20%20-%20Model%20**cannot**%20have%20memorized%20the%20answers%0A%20%20%20%20-%20100%25%20verifiable%20-%20we%20know%20the%20ground%20truth%0A%20%20%20%20-%20Clean%20signal%20for%20RL%20-%20no%20ambiguity%20about%20correctness%0A%20%20%20%20-%20We%20control%20the%20difficulty%0A%0A%20%20%20%20Let's%20import%20our%20benchmark%20library%3A%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20from%20zorplex_rl%20import%20(%0A%20%20%20%20%20%20%20%20TASK_SPECS%2C%0A%20%20%20%20%20%20%20%20get_spec%2C%0A%20%20%20%20%20%20%20%20ZORPLEX_WORDS%2C%0A%20%20%20%20%20%20%20%20make_zorplex_table%2C%0A%20%20%20%20%20%20%20%20generate_with_tools%2C%0A%20%20%20%20%20%20%20%20print_result%2C%0A%20%20%20%20)%0A%0A%20%20%20%20%23%20Show%20the%20secret%20table%0A%20%20%20%20table%20%3D%20make_zorplex_table(seed%3D42)%0A%20%20%20%20print(%22Zorplex%20Values%20(seed%3D42)%3A%22)%0A%20%20%20%20for%20_i%2C%20(_word%2C%20_value)%20in%20enumerate(table.items())%3A%0A%20%20%20%20%20%20%20%20if%20_i%20%3C%208%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20print(f%22%20%20%7B_word%7D%3A%20%7B_value%7D%22)%0A%20%20%20%20print(f%22%20%20...%20(%7Blen(table)%7D%20total%20words)%22)%0A%0A%20%20%20%20print(f%22%5CnRegistered%20task%20specs%3A%20%7Blist(TASK_SPECS.keys())%7D%22)%0A%20%20%20%20return%20TASK_SPECS%2C%20generate_with_tools%2C%20get_spec%2C%20print_result%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%23%20How%20Tool%20Injection%20Works%0A%0A%20%20%20%20The%20%60generate_with_tools%60%20function%20implements%20an%20**agentic%20loop**%20%E2%80%94%20generate%0A%20%20%20%20until%20the%20model%20outputs%20a%20tool%20call%2C%20execute%20it%2C%20inject%20the%20result%2C%20and%0A%20%20%20%20continue%3A%0A%0A%20%20%20%20%60%60%60%0A%20%20%20%20%E2%94%8C%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%90%0A%20%20%20%20%E2%94%82%20%201.%20Generate%20text%20until%20tool%20call%20detected%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%E2%94%82%0A%20%20%20%20%E2%94%82%20%20%20%20%20%E2%94%94%E2%94%80%20StoppingCriteria%20regex-matches%20LOOKUP%5B...%5D%20%20%20%20%20%20%20%20%20%20%20%E2%94%82%0A%20%20%20%20%E2%94%82%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%E2%94%82%0A%20%20%20%20%E2%94%82%20%202.%20Parse%20and%20execute%20the%20tool%20call%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%E2%94%82%0A%20%20%20%20%E2%94%82%20%20%20%20%20%E2%94%94%E2%94%80%20LOOKUP%5Bbanana%5D%20%E2%86%92%2042%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%E2%94%82%0A%20%20%20%20%E2%94%82%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%E2%94%82%0A%20%20%20%20%E2%94%82%20%203.%20Inject%20result%20into%20context%3A%20%22%5BResult%3A%2042%5D%22%20%20%20%20%20%20%20%20%20%20%20%20%20%20%E2%94%82%0A%20%20%20%20%E2%94%82%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%E2%94%82%0A%20%20%20%20%E2%94%82%20%204.%20Repeat%20until%20no%20tool%20calls%20or%20max%20turns%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%E2%94%82%0A%20%20%20%20%E2%94%94%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%80%E2%94%98%0A%20%20%20%20%60%60%60%0A%0A%20%20%20%20The%20model%20sees%20tool-use%20examples%20in%20the%20system%20prompt%2C%20learns%20the%20format%2C%0A%20%20%20%20and%20gets%20results%20injected%20back.%20Let's%20see%20what%20each%20difficulty%20level%0A%20%20%20%20asks%20it%20to%20do%3A%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Task%20Difficulty%20Levels%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(get_spec)%3A%0A%20%20%20%20%23%20Show%20each%20task's%20system%20prompt%0A%20%20%20%20for%20_task_name%20in%20%5B%22simple%22%2C%20%22compositional%22%2C%20%22multi_step%22%2C%20%22recursive%22%5D%3A%0A%20%20%20%20%20%20%20%20_spec%20%3D%20get_spec(_task_name%2C%20seed%3D42)%0A%20%20%20%20%20%20%20%20_task%20%3D%20_spec.generate_task()%0A%0A%20%20%20%20%20%20%20%20print(%22%3D%22%20*%2070)%0A%20%20%20%20%20%20%20%20print(f%22TASK%3A%20%7B_task_name%7D%20-%20%7B_spec.description%7D%22)%0A%20%20%20%20%20%20%20%20print(%22%3D%22%20*%2070)%0A%20%20%20%20%20%20%20%20print(f%22%5CnExample%20question%3A%20%7B_task.question%7D%22)%0A%20%20%20%20%20%20%20%20print(f%22Correct%20answer%3A%20%7B_task.correct_answer%7D%22)%0A%20%20%20%20%20%20%20%20if%20_task.metadata%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20print(f%22Metadata%3A%20%7B_task.metadata%7D%22)%0A%20%20%20%20%20%20%20%20print(f%22%5CnSystem%20prompt%3A%5Cn%7B_spec.get_system_prompt(with_hint%3DTrue)%7D%22)%0A%20%20%20%20%20%20%20%20print()%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Running%20the%20Benchmark%0A%0A%20%20%20%20Let's%20evaluate%20the%20model%20on%20all%20task%20types.%20This%20uses%20proper%20tool%20injection%20-%20the%20model%20generates%20until%20it%20outputs%20a%20tool%20call%2C%20we%20execute%20the%20tool%20and%20inject%20the%20result%2C%20then%20continue.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(%0A%20%20%20%20TASK_SPECS%2C%0A%20%20%20%20device%2C%0A%20%20%20%20generate_with_tools%2C%0A%20%20%20%20get_spec%2C%0A%20%20%20%20model%2C%0A%20%20%20%20tokenizer%2C%0A%20%20%20%20torch%2C%0A)%3A%0A%20%20%20%20import%20time%0A%20%20%20%20import%20random%0A%0A%20%20%20%20%23%20Run%20benchmark%20on%20all%20tasks%0A%20%20%20%20NUM_SAMPLES%20%3D%2010%0A%20%20%20%20SEED%20%3D%2042%0A%0A%20%20%20%20%23%20Set%20seeds%20for%20reproducibility%0A%20%20%20%20random.seed(SEED)%0A%20%20%20%20torch.manual_seed(SEED)%0A%20%20%20%20if%20torch.cuda.is_available()%3A%0A%20%20%20%20%20%20%20%20torch.cuda.manual_seed_all(SEED)%0A%0A%20%20%20%20benchmark_results%20%3D%20%7B%7D%0A%20%20%20%20all_latencies%20%3D%20%5B%5D%20%20%23%20Track%20latencies%20across%20all%20tasks%0A%0A%20%20%20%20for%20bench_task_name%20in%20TASK_SPECS.keys()%3A%0A%20%20%20%20%20%20%20%20bench_spec%20%3D%20get_spec(bench_task_name%2C%20seed%3DSEED)%0A%20%20%20%20%20%20%20%20results%20%3D%20%5B%5D%0A%20%20%20%20%20%20%20%20latencies%20%3D%20%5B%5D%0A%0A%20%20%20%20%20%20%20%20print(f%22Running%20%7Bbench_task_name%7D...%22%2C%20end%3D%22%20%22%2C%20flush%3DTrue)%0A%20%20%20%20%20%20%20%20for%20_%20in%20range(NUM_SAMPLES)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20bench_task%20%3D%20bench_spec.generate_task()%0A%0A%20%20%20%20%20%20%20%20%20%20%20%20start_time%20%3D%20time.perf_counter()%0A%20%20%20%20%20%20%20%20%20%20%20%20result%20%3D%20generate_with_tools(%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20model%2C%20tokenizer%2C%20bench_spec%2C%20bench_task%2C%20device%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20max_turns%3D5%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20)%0A%20%20%20%20%20%20%20%20%20%20%20%20elapsed_ms%20%3D%20(time.perf_counter()%20-%20start_time)%20*%201000%0A%0A%20%20%20%20%20%20%20%20%20%20%20%20results.append(result)%0A%20%20%20%20%20%20%20%20%20%20%20%20latencies.append(elapsed_ms)%0A%20%20%20%20%20%20%20%20%20%20%20%20all_latencies.append(%7B%22task%22%3A%20bench_task_name%2C%20%22latency_ms%22%3A%20elapsed_ms%7D)%0A%0A%20%20%20%20%20%20%20%20num_correct%20%3D%20sum(1%20for%20r%20in%20results%20if%20r.is_correct)%0A%20%20%20%20%20%20%20%20print(f%22%7Bnum_correct%7D%2F%7BNUM_SAMPLES%7D%20correct%22)%0A%20%20%20%20%20%20%20%20avg_turns%20%3D%20sum(len(r.turns)%20for%20r%20in%20results)%20%2F%20len(results)%0A%20%20%20%20%20%20%20%20avg_tools%20%3D%20sum(r.total_tool_calls%20for%20r%20in%20results)%20%2F%20len(results)%0A%0A%20%20%20%20%20%20%20%20benchmark_results%5Bbench_task_name%5D%20%3D%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22results%22%3A%20results%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22latencies%22%3A%20latencies%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22accuracy%22%3A%20num_correct%20%2F%20NUM_SAMPLES%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22correct%22%3A%20num_correct%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22total%22%3A%20NUM_SAMPLES%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22avg_turns%22%3A%20avg_turns%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22avg_tools%22%3A%20avg_tools%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22avg_latency_ms%22%3A%20sum(latencies)%20%2F%20len(latencies)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22min_latency_ms%22%3A%20min(latencies)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22max_latency_ms%22%3A%20max(latencies)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22description%22%3A%20bench_spec.description%2C%0A%20%20%20%20%20%20%20%20%7D%0A%0A%20%20%20%20%23%20Print%20summary%20table%0A%20%20%20%20print(%22%3D%22%20*%2060)%0A%20%20%20%20print(%22ZORPLEX%20BENCHMARK%20RESULTS%22)%0A%20%20%20%20print(%22%3D%22%20*%2060)%0A%20%20%20%20print(f%22%7B'Task'%3A%3C15%7D%20%7B'Description'%3A%3C25%7D%20%7B'Accuracy'%3A%3E12%7D%22)%0A%20%20%20%20print(%22-%22%20*%2060)%0A%0A%20%20%20%20for%20bench_task_name%2C%20_data%20in%20benchmark_results.items()%3A%0A%20%20%20%20%20%20%20%20print(%0A%20%20%20%20%20%20%20%20%20%20%20%20f%22%7Bbench_task_name%3A%3C15%7D%20%22%0A%20%20%20%20%20%20%20%20%20%20%20%20f%22%7B_data%5B'description'%5D%3A%3C25%7D%20%22%0A%20%20%20%20%20%20%20%20%20%20%20%20f%22%7B_data%5B'correct'%5D%3A%3E3%7D%2F%7B_data%5B'total'%5D%3A%3C3%7D%20(%7B_data%5B'accuracy'%5D*100%3A%3E3.0f%7D%25)%22%0A%20%20%20%20%20%20%20%20)%0A%0A%20%20%20%20print(%22-%22%20*%2060)%0A%20%20%20%20total_correct%20%3D%20sum(d%5B%22correct%22%5D%20for%20d%20in%20benchmark_results.values())%0A%20%20%20%20total_samples%20%3D%20sum(d%5B%22total%22%5D%20for%20d%20in%20benchmark_results.values())%0A%20%20%20%20print(f%22%7B'OVERALL'%3A%3C15%7D%20%7B''%3A%3C25%7D%20%7Btotal_correct%3A%3E3%7D%2F%7Btotal_samples%3A%3C3%7D%20(%7Btotal_correct%2Ftotal_samples*100%3A%3E3.0f%7D%25)%22)%0A%20%20%20%20print(%22%3D%22%20*%2060)%0A%20%20%20%20return%20all_latencies%2C%20benchmark_results%0A%0A%0A%40app.cell%0Adef%20_(benchmark_results%2C%20mo)%3A%0A%20%20%20%20%23%20Build%20markdown%20table%20from%20results%0A%20%20%20%20rows%20%3D%20%5B%5D%0A%20%20%20%20for%20_name%2C%20_data%20in%20benchmark_results.items()%3A%0A%20%20%20%20%20%20%20%20acc_str%20%3D%20f%22%7B_data%5B'correct'%5D%7D%2F%7B_data%5B'total'%5D%7D%20(%7B_data%5B'accuracy'%5D*100%3A.0f%7D%25)%22%0A%20%20%20%20%20%20%20%20rows.append(f%22%7C%20**%7B_name%7D**%20%7C%20%7B_data%5B'description'%5D%7D%20%7C%20%7Bacc_str%7D%20%7C%22)%0A%0A%20%20%20%20table_md%20%3D%20%22%5Cn%22.join(rows)%0A%0A%20%20%20%20mo.md(f%22%22%22%0A%20%20%20%20%23%23%20Results%20Analysis%0A%0A%20%20%20%20%7C%20Task%20%7C%20Description%20%7C%20Accuracy%20%7C%0A%20%20%20%20%7C------%7C-------------%7C----------%7C%0A%20%20%20%20%7Btable_md%7D%0A%0A%20%20%20%20Accuracy%20is%20lower%20than%20you%20might%20expect%20%E2%80%94%20and%20that's%20largely%20because%20of%20the%0A%20%20%20%20%60%5BANSWER%5D%60%20format%20requirement.%20The%20model%20can%20often%20get%20*right%20value*%20but%0A%20%20%20%20fails%20to%20emit%20it%20in%20the%20required%20format.%20This%20is%20actually%20good%20news%20for%20RL%3A%0A%20%20%20%20the%20capability%20is%20there%2C%20it%20just%20needs%20to%20be%20reinforced.%0A%0A%20%20%20%20Let's%20look%20at%20concrete%20trajectories%20to%20understand%20the%20failure%20modes.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(benchmark_results%2C%20print_result)%3A%0A%20%20%20%20import%20re%20as%20_re%0A%0A%20%20%20%20%23%20Classify%20results%20by%20failure%20mode%0A%20%20%20%20successes%20%3D%20%5B%5D%0A%20%20%20%20format_failures%20%3D%20%5B%5D%20%20%23%20Right%20value%2C%20wrong%20format%0A%20%20%20%20tool_spam%20%3D%20%5B%5D%20%20%20%20%20%20%20%20%23%20Excessive%20tool%20calls%20without%20progress%0A%20%20%20%20wrong_answer%20%3D%20%5B%5D%20%20%20%20%20%23%20Wrong%20computation%20or%20value%0A%0A%20%20%20%20for%20_task_name%2C%20_data%20in%20benchmark_results.items()%3A%0A%20%20%20%20%20%20%20%20for%20_r%20in%20_data%5B%22results%22%5D%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20if%20_r.is_correct%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20successes.append((_task_name%2C%20_r))%0A%20%20%20%20%20%20%20%20%20%20%20%20else%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Check%20if%20the%20correct%20answer%20appears%20in%20the%20trajectory%20but%20wasn't%20extracted%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20correct_str%20%3D%20str(_r.task.correct_answer)%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20has_correct_value%20%3D%20correct_str%20in%20_r.final_text%0A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Check%20for%20%5BANSWER%5D%20tag%20presence%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20has_answer_tag%20%3D%20bool(_re.search(r'%5C%5BANSWER%5C%5D'%2C%20_r.final_text))%0A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20if%20not%20has_answer_tag%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20format_failures.append((_task_name%2C%20_r))%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20elif%20_r.total_tool_calls%20%3E%203%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20tool_spam.append((_task_name%2C%20_r))%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20else%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20wrong_answer.append((_task_name%2C%20_r))%0A%0A%20%20%20%20%23%20---%20Successes%20---%0A%20%20%20%20print(%22%3D%22%20*%2070)%0A%20%20%20%20print(f%22SUCCESS%20EXAMPLES%20(%7Blen(successes)%7D%20total)%22)%0A%20%20%20%20print(%22%3D%22%20*%2070)%0A%20%20%20%20for%20_task_name%2C%20_r%20in%20successes%5B%3A2%5D%3A%0A%20%20%20%20%20%20%20%20print(f%22%5Cn---%20%7B_task_name%7D%20---%22)%0A%20%20%20%20%20%20%20%20print_result(_r)%0A%0A%20%20%20%20%23%20---%20Format%20failures%20---%0A%20%20%20%20print(%22%5Cn%22%20%2B%20%22%3D%22%20*%2070)%0A%20%20%20%20print(f%22FAILURE%20MODE%201%3A%20Wrong%20format%20(%7Blen(format_failures)%7D%20total)%22)%0A%20%20%20%20print(%22The%20model%20never%20emitted%20%5BANSWER%5D.%22)%0A%20%20%20%20print(%22%3D%22%20*%2070)%0A%20%20%20%20for%20_task_name%2C%20_r%20in%20format_failures%5B%3A2%5D%3A%0A%20%20%20%20%20%20%20%20print(f%22%5Cn---%20%7B_task_name%7D%20---%22)%0A%20%20%20%20%20%20%20%20print_result(_r)%0A%0A%20%20%20%20%23%20---%20Tool%20spam%20---%0A%20%20%20%20if%20tool_spam%3A%0A%20%20%20%20%20%20%20%20print(%22%5Cn%22%20%2B%20%22%3D%22%20*%2070)%0A%20%20%20%20%20%20%20%20print(f%22FAILURE%20MODE%202%3A%20Tool%20spam%20(%7Blen(tool_spam)%7D%20total)%22)%0A%20%20%20%20%20%20%20%20print(%22Excessive%20tool%20calls%20without%20making%20progress.%22)%0A%20%20%20%20%20%20%20%20print(%22%3D%22%20*%2070)%0A%20%20%20%20%20%20%20%20for%20_task_name%2C%20_r%20in%20tool_spam%5B%3A2%5D%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20print(f%22%5Cn---%20%7B_task_name%7D%20---%22)%0A%20%20%20%20%20%20%20%20%20%20%20%20print_result(_r)%0A%0A%20%20%20%20%23%20---%20Wrong%20answer%20---%0A%20%20%20%20if%20wrong_answer%3A%0A%20%20%20%20%20%20%20%20print(%22%5Cn%22%20%2B%20%22%3D%22%20*%2070)%0A%20%20%20%20%20%20%20%20print(f%22FAILURE%20MODE%203%3A%20Wrong%20answer%20(%7Blen(wrong_answer)%7D%20total)%22)%0A%20%20%20%20%20%20%20%20print(%22Incorrect%20computation%20or%20value%20extraction.%22)%0A%20%20%20%20%20%20%20%20print(%22%3D%22%20*%2070)%0A%20%20%20%20%20%20%20%20for%20_task_name%2C%20_r%20in%20wrong_answer%5B%3A2%5D%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20print(f%22%5Cn---%20%7B_task_name%7D%20---%22)%0A%20%20%20%20%20%20%20%20%20%20%20%20print_result(_r)%0A%0A%20%20%20%20print(%22%5Cn%22%20%2B%20%22%3D%22%20*%2070)%0A%20%20%20%20print(%22SUMMARY%22)%0A%20%20%20%20print(%22%3D%22%20*%2070)%0A%20%20%20%20print(f%22%20%20Successes%3A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7Blen(successes)%7D%22)%0A%20%20%20%20print(f%22%20%20Right%20value%2C%20bad%20format%3A%20%7Blen(format_failures)%7D%22)%0A%20%20%20%20print(f%22%20%20Tool%20spam%3A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%7Blen(tool_spam)%7D%22)%0A%20%20%20%20print(f%22%20%20Wrong%20answer%3A%20%20%20%20%20%20%20%20%20%20%20%7Blen(wrong_answer)%7D%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20From%20Failure%20Modes%20to%20RL%20Rewards%0A%0A%20%20%20%20The%20trajectories%20above%20reveal%20three%20distinct%20failure%20modes%2C%20and%20each%20maps%0A%20%20%20%20to%20a%20reward%20signal%3A%0A%0A%20%20%20%20%7C%20Failure%20Mode%20%7C%20What%20Happens%20%7C%20RL%20Signal%20%7C%0A%20%20%20%20%7C---%7C---%7C---%7C%0A%20%20%20%20%7C%20**Wrong%20format**%20%7C%20Model%20doesn't%20emit%20%60%5BANSWER%5D%60%20%7C%20Reward%20format%20compliance%20%7C%0A%20%20%20%20%7C%20**Tool%20spam**%20%7C%20Excessive%20LOOKUPs%20without%20progress%20%7C%20Penalize%20unnecessary%20tool%20calls%20%7C%0A%20%20%20%20%7C%20**Wrong%20answer**%20%7C%20Incorrect%20arithmetic%20or%20value%20extraction%20%7C%20Binary%20correctness%20reward%20(0%20or%201)%20%7C%0A%0A%20%20%20%20The%20key%20insight%3A%20**the%20model%20already%20has%20most%20of%20the%20capability**.%20It%20can%0A%20%20%20%20call%20tools%2C%20follow%20redirects%2C%20and%20often%20arrives%20at%20the%20right%20value.%20What%0A%20%20%20%20it%20lacks%20is%20the%20discipline%20to%20format%20answers%20correctly%20and%20compose%20results%0A%20%20%20%20reliably.%20These%20are%20exactly%20the%20behaviors%20RL%20can%20reinforce.%0A%0A%20%20%20%20Now%20let's%20talk%20about%20how%20to%20build%20the%20training%20loop...%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20From%20Task%20to%20Training%20Loop%0A%0A%20%20%20%20Unlike%20supervised%20fine-tuning%20(%22here's%20the%20right%20answer%2C%20copy%20it%22)%2C%20RL%0A%20%20%20%20learns%20from%20**exploration**%20%E2%80%94%20the%20model%20tries%20things%20and%20gets%20a%20reward%0A%20%20%20%20signal.%20For%20Zorplex%2C%20this%20means%20it%20can%20discover%20on%20its%20own%20that%20guessing%0A%20%20%20%20values%20gives%20reward%200%2C%20while%20calling%20LOOKUP%20gives%20reward%201.%20No%20one%20needs%0A%20%20%20%20to%20show%20it%20the%20right%20trajectory.%0A%0A%20%20%20%20**Note%20on%20reward%20design%3A**%20The%20binary%20reward%20(correct%3D1%2C%20incorrect%3D0)%20is%0A%20%20%20%20simple%20but%20sparse.%20In%20practice%2C%20you%20might%20add%20**reward%20shaping**%20%E2%80%94%0A%20%20%20%20penalties%20for%20excessive%20tool%20calls%20(compositional%20tasks%20tend%20to%20spam%0A%20%20%20%20LOOKUP)%2C%20bonuses%20for%20correct%20output%20formatting.%20We'll%20explore%20shaped%0A%20%20%20%20rewards%20in%20NB08%20when%20we%20wire%20up%20the%20full%20RL%20loop.%0A%0A%20%20%20%20The%20standard%20**synchronous%20RL%20loop**%20looks%20like%20this%3A%0A%0A%20%20%20%20%60%60%60python%0A%20%20%20%20%23%20Each%20step%3A%20generate%20on%20all%20GPUs%2C%20wait%2C%20then%20train%0A%20%20%20%20while%20training%3A%0A%20%20%20%20%20%20%20%20trajectories%20%3D%20generators.generate(prompts)%20%20%23%20SLOW%20%E2%80%94%20dominates%20runtime%0A%20%20%20%20%20%20%20%20rewards%20%3D%20evaluate(trajectories)%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20fast%20%E2%80%94%20check%20answer%0A%20%20%20%20%20%20%20%20loss%20%3D%20rl_loss(trajectories%2C%20rewards)%20%20%20%20%20%20%20%20%20%20%23%20policy%20gradient%20update%0A%20%20%20%20%20%20%20%20loss.backward()%0A%20%20%20%20%20%20%20%20optimizer.step()%0A%20%20%20%20%60%60%60%0A%0A%20%20%20%20Each%20generator%20produces%20one%20trajectory%20per%20step.%20We%20wait%20for%20**all**%20of%0A%20%20%20%20them%20to%20finish%20before%20training.%20Spot%20the%20problem%3F%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(all_latencies%2C%20mo)%3A%0A%20%20%20%20%23%20Calculate%20latency%20stats%20for%20the%20narrative%0A%20%20%20%20lats_%20%3D%20%5Bl%5B%22latency_ms%22%5D%20for%20l%20in%20all_latencies%5D%0A%20%20%20%20lat_min%2C%20lat_max%2C%20lat_avg%20%3D%20min(lats_)%2C%20max(lats_)%2C%20sum(lats_)%20%2F%20len(lats_)%0A%20%20%20%20lat_range%20%3D%20lat_max%20%2F%20lat_min%0A%0A%20%20%20%20mo.md(f%22%22%22%0A%20%20%20%20%23%23%20The%20Latency%20Variance%20Problem%0A%0A%20%20%20%20Look%20at%20the%20latencies%20from%20our%20benchmark%3A%0A%0A%20%20%20%20-%20**Min**%3A%20%7Blat_min%3A.0f%7Dms%0A%20%20%20%20-%20**Avg**%3A%20%7Blat_avg%3A.0f%7Dms%0A%20%20%20%20-%20**Max**%3A%20%7Blat_max%3A.0f%7Dms%0A%20%20%20%20-%20**Ratio**%3A%20%7Blat_range%3A.1f%7Dx%20variance%20from%20fastest%20to%20slowest%0A%0A%20%20%20%20In%20a%20real%20RL%20loop%2C%20you're%20mixing%20tasks%20of%20varying%20difficulty.%20Some%0A%20%20%20%20trajectories%20finish%20fast%2C%20others%20take%20multiple%20tool%20calls%20and%20hit%20max%0A%20%20%20%20turns.%20In%20the%20sync%20loop%2C%20the%20trainer%20sits%20idle%20during%20generation%2C%20and%0A%20%20%20%20generators%20sit%20idle%20during%20training%20%E2%80%94%20the%20slowest%20trajectory%20sets%20the%0A%20%20%20%20pace%20for%20the%20entire%20batch%3A%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20def%20make_sync_plot()%3A%0A%20%20%20%20%20%20%20%20import%20matplotlib.pyplot%20as%20plt%0A%20%20%20%20%20%20%20%20import%20matplotlib.patches%20as%20mpatches%0A%0A%20%20%20%20%20%20%20%20%23%20Sync%20RL%20visualization%20-%20showing%20straggler%20problem%0A%20%20%20%20%20%20%20%20fig%2C%20ax%20%3D%20plt.subplots(figsize%3D(10%2C%203))%0A%0A%20%20%20%20%20%20%20%20%23%20Data%3A%20(row%2C%20start%2C%20duration%2C%20color)%0A%20%20%20%20%20%20%20%20data%20%3D%20%5B%0A%20%20%20%20%20%20%20%20%20%20%20%20(0%2C%200%2C%203%2C%20%22%2366c2a5%22)%2C%20%20%20%23%20Gen%201%3A%20fast%0A%20%20%20%20%20%20%20%20%20%20%20%20(1%2C%200%2C%205%2C%20%22%2366c2a5%22)%2C%20%20%20%23%20Gen%202%3A%20medium%0A%20%20%20%20%20%20%20%20%20%20%20%20(2%2C%200%2C%203%2C%20%22%2366c2a5%22)%2C%20%20%20%23%20Gen%203%3A%20fast%0A%20%20%20%20%20%20%20%20%20%20%20%20(3%2C%200%2C%2010%2C%20%22%23fc8d62%22)%2C%20%23%20Gen%204%3A%20straggler%20(different%20color)%0A%20%20%20%20%20%20%20%20%20%20%20%20(4%2C%200%2C%2010%2C%20%22%23d3d3d3%22)%2C%20%23%20Trainer%3A%20idle%0A%20%20%20%20%20%20%20%20%20%20%20%20(4%2C%2010%2C%203%2C%20%22%23e78ac3%22)%2C%20%23%20Trainer%3A%20train%0A%20%20%20%20%20%20%20%20%5D%0A%0A%20%20%20%20%20%20%20%20for%20row%2C%20start%2C%20duration%2C%20color%20in%20data%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20ax.barh(row%2C%20duration%2C%20left%3Dstart%2C%20height%3D0.6%2C%20color%3Dcolor%2C%20edgecolor%3D%22black%22%2C%20linewidth%3D0.5)%0A%0A%20%20%20%20%20%20%20%20%23%20Add%20%22idle%22%20and%20%22train%22%20labels%0A%20%20%20%20%20%20%20%20ax.text(5%2C%204%2C%20%22idle%22%2C%20ha%3D%22center%22%2C%20va%3D%22center%22%2C%20fontsize%3D9%2C%20color%3D%22gray%22)%0A%20%20%20%20%20%20%20%20ax.text(11.5%2C%204%2C%20%22train%22%2C%20ha%3D%22center%22%2C%20va%3D%22center%22%2C%20fontsize%3D9%2C%20color%3D%22white%22%2C%20fontweight%3D%22bold%22)%0A%0A%20%20%20%20%20%20%20%20ax.set_yticks(range(5))%0A%20%20%20%20%20%20%20%20ax.set_yticklabels(%5B%22Gen%201%22%2C%20%22Gen%202%22%2C%20%22Gen%203%22%2C%20%22Gen%204%22%2C%20%22Trainer%22%5D)%0A%20%20%20%20%20%20%20%20ax.set_xlabel(%22Time%20(ms)%22)%0A%20%20%20%20%20%20%20%20ax.set_title(%22Sync%20RL%20-%20Waiting%20for%20Stragglers%22)%0A%20%20%20%20%20%20%20%20ax.grid(axis%3D%22x%22%2C%20alpha%3D0.3)%0A%20%20%20%20%20%20%20%20ax.invert_yaxis()%0A%0A%20%20%20%20%20%20%20%20%23%20Legend%0A%20%20%20%20%20%20%20%20patches%20%3D%20%5B%0A%20%20%20%20%20%20%20%20%20%20%20%20mpatches.Patch(color%3D%22%2366c2a5%22%2C%20label%3D%22Fast%2FMedium%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20mpatches.Patch(color%3D%22%23fc8d62%22%2C%20label%3D%22Straggler%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20mpatches.Patch(color%3D%22%23d3d3d3%22%2C%20label%3D%22Idle%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20mpatches.Patch(color%3D%22%23e78ac3%22%2C%20label%3D%22Train%22)%2C%0A%20%20%20%20%20%20%20%20%5D%0A%20%20%20%20%20%20%20%20ax.legend(handles%3Dpatches%2C%20loc%3D%22upper%20right%22%2C%20fontsize%3D8)%0A%0A%20%20%20%20%20%20%20%20plt.tight_layout()%0A%20%20%20%20%20%20%20%20return%20fig%0A%0A%20%20%20%20make_sync_plot()%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20**long-tail%20stragglers**%20set%20the%20pace.%20At%20scale%2C%20this%20kills%20GPU%20utilization.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(all_latencies)%3A%0A%20%20%20%20def%20make_kde_plot(latencies)%3A%0A%20%20%20%20%20%20%20%20import%20matplotlib.pyplot%20as%20plt%0A%20%20%20%20%20%20%20%20import%20numpy%20as%20np%0A%20%20%20%20%20%20%20%20from%20scipy%20import%20stats%0A%0A%20%20%20%20%20%20%20%20%23%20Group%20latencies%20by%20task%0A%20%20%20%20%20%20%20%20task_latencies%20%3D%20%7B%7D%0A%20%20%20%20%20%20%20%20for%20entry%20in%20latencies%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20task%20%3D%20entry%5B%22task%22%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20if%20task%20not%20in%20task_latencies%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20task_latencies%5Btask%5D%20%3D%20%5B%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20task_latencies%5Btask%5D.append(entry%5B%22latency_ms%22%5D)%0A%0A%20%20%20%20%20%20%20%20%23%20Create%20overlaid%20KDE%20plot%0A%20%20%20%20%20%20%20%20fig%2C%20ax%20%3D%20plt.subplots(figsize%3D(10%2C%204))%0A%0A%20%20%20%20%20%20%20%20colors%20%3D%20%5B%22%231f77b4%22%2C%20%22%23ff7f0e%22%2C%20%22%232ca02c%22%2C%20%22%23d62728%22%5D%0A%20%20%20%20%20%20%20%20all_lats%20%3D%20%5Bl%5B%22latency_ms%22%5D%20for%20l%20in%20latencies%5D%0A%20%20%20%20%20%20%20%20x_range%20%3D%20np.linspace(min(all_lats)%20*%200.8%2C%20max(all_lats)%20*%201.1%2C%20200)%0A%0A%20%20%20%20%20%20%20%20for%20i%2C%20(task%2C%20lats)%20in%20enumerate(task_latencies.items())%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%23%20Compute%20KDE%0A%20%20%20%20%20%20%20%20%20%20%20%20kde%20%3D%20stats.gaussian_kde(lats)%0A%20%20%20%20%20%20%20%20%20%20%20%20density%20%3D%20kde(x_range)%0A%20%20%20%20%20%20%20%20%20%20%20%20ax.fill_between(x_range%2C%20density%2C%20alpha%3D0.3%2C%20color%3Dcolors%5Bi%5D%2C%20label%3Dtask)%0A%20%20%20%20%20%20%20%20%20%20%20%20ax.plot(x_range%2C%20density%2C%20color%3Dcolors%5Bi%5D%2C%20linewidth%3D2)%0A%0A%20%20%20%20%20%20%20%20ax.set_xlabel(%22Latency%20(ms)%22%2C%20fontsize%3D12)%0A%20%20%20%20%20%20%20%20ax.set_ylabel(%22Density%22%2C%20fontsize%3D12)%0A%20%20%20%20%20%20%20%20ax.set_title(%22Trajectory%20Generation%20Latency%20Distribution%20by%20Task%20Type%22%2C%20fontsize%3D14)%0A%20%20%20%20%20%20%20%20ax.axvline(x%3Dsum(all_lats)%20%2F%20len(all_lats)%2C%20color%3D%22red%22%2C%20linestyle%3D%22--%22%2C%20alpha%3D0.7%2C%20label%3D%22mean%22)%0A%20%20%20%20%20%20%20%20ax.legend(loc%3D%22upper%20right%22)%0A%20%20%20%20%20%20%20%20ax.grid(axis%3D%22x%22%2C%20alpha%3D0.3)%0A%0A%20%20%20%20%20%20%20%20plt.tight_layout()%0A%20%20%20%20%20%20%20%20return%20fig%0A%0A%20%20%20%20make_kde_plot(all_latencies)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20The%20Async%20Solution%20(and%20Its%20Trade-off)%0A%0A%20%20%20%20The%20fix%3A%20**decouple%20generation%20from%20training**%20with%20a%20replay%20buffer.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20def%20make_async_plot()%3A%0A%20%20%20%20%20%20%20%20import%20matplotlib.pyplot%20as%20plt%0A%20%20%20%20%20%20%20%20import%20matplotlib.patches%20as%20mpatches%0A%0A%20%20%20%20%20%20%20%20%23%20Async%20RL%20visualization%20-%20showing%20continuous%20utilization%0A%20%20%20%20%20%20%20%20fig%2C%20ax%20%3D%20plt.subplots(figsize%3D(10%2C%203.5))%0A%0A%20%20%20%20%20%20%20%20%23%20Colors%20for%20different%20trajectory%20lengths%0A%20%20%20%20%20%20%20%20color_map%20%3D%20%7B%22short%22%3A%20%22%2366c2a5%22%2C%20%22medium%22%3A%20%22%238da0cb%22%2C%20%22long%22%3A%20%22%23fc8d62%22%7D%0A%0A%20%20%20%20%20%20%20%20%23%20Data%3A%20(row%2C%20start%2C%20duration%2C%20type)%0A%20%20%20%20%20%20%20%20data%20%3D%20%5B%0A%20%20%20%20%20%20%20%20%20%20%20%20%23%20Gen%201%3A%20short%2C%20medium%2C%20short%0A%20%20%20%20%20%20%20%20%20%20%20%20(0%2C%200%2C%202%2C%20%22short%22)%2C%20(0%2C%202%2C%204%2C%20%22medium%22)%2C%20(0%2C%206%2C%202%2C%20%22short%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%23%20Gen%202%3A%20medium%2C%20short%2C%20long%0A%20%20%20%20%20%20%20%20%20%20%20%20(1%2C%200%2C%204%2C%20%22medium%22)%2C%20(1%2C%204%2C%202%2C%20%22short%22)%2C%20(1%2C%206%2C%206%2C%20%22long%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%23%20Gen%203%3A%20long%2C%20short%2C%20medium%0A%20%20%20%20%20%20%20%20%20%20%20%20(2%2C%200%2C%205%2C%20%22long%22)%2C%20(2%2C%205%2C%202%2C%20%22short%22)%2C%20(2%2C%207%2C%204%2C%20%22medium%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%23%20Gen%204%3A%20short%2C%20long%2C%20short%0A%20%20%20%20%20%20%20%20%20%20%20%20(3%2C%200%2C%202%2C%20%22short%22)%2C%20(3%2C%202%2C%206%2C%20%22long%22)%2C%20(3%2C%208%2C%202%2C%20%22short%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%23%20Buffer%20(visual%20indicator)%0A%20%20%20%20%20%20%20%20%20%20%20%20(4%2C%200%2C%2012%2C%20%22%23e0e0e0%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%23%20Trainer%3A%20continuous%20training%0A%20%20%20%20%20%20%20%20%20%20%20%20(5%2C%200%2C%203%2C%20%22%23e78ac3%22)%2C%20(5%2C%203%2C%203%2C%20%22%23e78ac3%22)%2C%20(5%2C%206%2C%203%2C%20%22%23e78ac3%22)%2C%20(5%2C%209%2C%203%2C%20%22%23e78ac3%22)%2C%0A%20%20%20%20%20%20%20%20%5D%0A%0A%20%20%20%20%20%20%20%20for%20item%20in%20data%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20row%2C%20start%2C%20duration%20%3D%20item%5B0%5D%2C%20item%5B1%5D%2C%20item%5B2%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20if%20isinstance(item%5B3%5D%2C%20str)%20and%20item%5B3%5D%20in%20color_map%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20color%20%3D%20color_map%5Bitem%5B3%5D%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20else%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20color%20%3D%20item%5B3%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20ax.barh(row%2C%20duration%2C%20left%3Dstart%2C%20height%3D0.6%2C%20color%3Dcolor%2C%20edgecolor%3D%22black%22%2C%20linewidth%3D0.5)%0A%0A%20%20%20%20%20%20%20%20%23%20Buffer%20label%0A%20%20%20%20%20%20%20%20ax.text(6%2C%204%2C%20%22Buffer%22%2C%20ha%3D%22center%22%2C%20va%3D%22center%22%2C%20fontsize%3D9%2C%20color%3D%22gray%22)%0A%0A%20%20%20%20%20%20%20%20ax.set_yticks(range(6))%0A%20%20%20%20%20%20%20%20ax.set_yticklabels(%5B%22Gen%201%22%2C%20%22Gen%202%22%2C%20%22Gen%203%22%2C%20%22Gen%204%22%2C%20%22Buffer%22%2C%20%22Trainer%22%5D)%0A%20%20%20%20%20%20%20%20ax.set_xlabel(%22Time%20(ms)%22)%0A%20%20%20%20%20%20%20%20ax.set_title(%22Async%20RL%20-%20Buffer%20Absorbs%20Variance%22)%0A%20%20%20%20%20%20%20%20ax.grid(axis%3D%22x%22%2C%20alpha%3D0.3)%0A%20%20%20%20%20%20%20%20ax.invert_yaxis()%0A%0A%20%20%20%20%20%20%20%20%23%20Legend%0A%20%20%20%20%20%20%20%20patches%20%3D%20%5B%0A%20%20%20%20%20%20%20%20%20%20%20%20mpatches.Patch(color%3D%22%2366c2a5%22%2C%20label%3D%22Short%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20mpatches.Patch(color%3D%22%238da0cb%22%2C%20label%3D%22Medium%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20mpatches.Patch(color%3D%22%23fc8d62%22%2C%20label%3D%22Long%22)%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20mpatches.Patch(color%3D%22%23e78ac3%22%2C%20label%3D%22Train%22)%2C%0A%20%20%20%20%20%20%20%20%5D%0A%20%20%20%20%20%20%20%20ax.legend(handles%3Dpatches%2C%20loc%3D%22upper%20right%22%2C%20fontsize%3D8)%0A%0A%20%20%20%20%20%20%20%20plt.tight_layout()%0A%20%20%20%20%20%20%20%20return%20fig%0A%0A%20%20%20%20make_async_plot()%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Generators%20produce%20trajectories%20as%20fast%20as%20they%20can%2C%20dropping%20them%20into%20a%0A%20%20%20%20**replay%20buffer**.%20The%20trainer%20consumes%20from%20the%20buffer%20independently.%0A%20%20%20%20The%20buffer%20absorbs%20more%20of%20the%20performance%20impact%20of%20the%20variation%20in%20generation%20time%20latencies.%0A%0A%20%20%20%20%23%23%23%20Why%20Async%20Is%20Worth%20It%20(Beyond%20Stragglers)%0A%0A%20%20%20%20The%20straggler%20problem%20is%20the%20most%20obvious%20win%2C%20but%20async%20RL%20has%20deeper%0A%20%20%20%20systems%20benefits%3A%0A%0A%20%20%20%20-%20**Independent%20scaling**%20%E2%80%94%20tune%20the%20generator-to-trainer%20ratio%20without%0A%20%20%20%20%20%20changing%20code.%20Adding%20generators%20becomes%20changing%20a%20number%20in%20your%20actor%20spawn%20call.%0A%20%20%20%20-%20**Heterogeneous%20hardware**%20%E2%80%94%20use%20inference-optimized%20GPUs%20for%20generation%0A%20%20%20%20%20%20and%20training-optimized%20GPUs%20(with%20fast%20interconnect)%20for%20gradient%20updates.%0A%20%20%20%20-%20**Fault%20isolation**%20%E2%80%94%20a%20crashed%20generator%20is%20just%20one%20fewer%20producer.%20The%0A%20%20%20%20%20%20buffer%20and%20trainer%20are%20unaffected.%20Compare%20this%20to%20sync%20RL%2C%20where%20one%0A%20%20%20%20%20%20crash%20loses%20the%20entire%20batch.%0A%0A%20%20%20%20%23%23%23%20The%20Off-Policy%20Trade-off%0A%0A%20%20%20%20There's%20a%20catch%3A%20by%20the%20time%20the%20trainer%20uses%20a%20trajectory%2C%20the%20model%0A%20%20%20%20weights%20may%20have%20changed.%20The%20trajectory%20was%20generated%20by%20an%20**old%0A%20%20%20%20policy**%20%E2%80%94%20this%20is%20called%20**off-policy**%20training.%0A%0A%20%20%20%20%7C%20Aspect%20%7C%20Sync%20(on-policy)%20%7C%20Async%20(off-policy)%20%7C%0A%20%20%20%20%7C--------%7C------------------%7C--------------------%7C%0A%20%20%20%20%7C%20Trajectory%20freshness%20%7C%20Always%20current%20weights%20%7C%20May%20be%20stale%20%7C%0A%20%20%20%20%7C%20GPU%20utilization%20%7C%20Low%20(waiting)%20%7C%20High%20(continuous)%20%7C%0A%20%20%20%20%7C%20Learning%20stability%20%7C%20More%20stable%20%7C%20Requires%20bounded%20staleness%20%7C%0A%0A%20%20%20%20In%20practice%2C%20systems%20keep%20the%20staleness%20bounded%20%E2%80%94%20trajectories%20more%20than%0A%20%20%20%20a%20few%20gradient%20steps%20old%20get%20evicted%20from%20the%20buffer.%20PPO%20handles%20mild%0A%20%20%20%20off-policy%20data%20through%20importance%20sampling%20ratios%20(%CF%80_new%20%2F%20%CF%80_old)%2C%0A%20%20%20%20clipped%20to%20prevent%20large%20policy%20updates.%20The%20systems%20efficiency%20gain%0A%20%20%20%20is%20worth%20it.%0A%0A%20%20%20%20Strictly%20speaking%2C%20algorithms%20like%20PPO%20and%20GRPO%20are%20meant%20to%20be%0A%20%20%20%20**on-policy**%20%E2%80%94%20they%20assume%20trajectories%20come%20from%20the%20current%20weights.%0A%20%20%20%20But%20using%20clipped%20importance%20sampling%20ratios%20(%CF%80_new%20%2F%20%CF%80_old)%20helps%0A%20%20%20%20tolerate%20mild%20staleness%20empirically%2C%20allowing%20async%20systems%20to%20trade%0A%20%20%20%20a%20small%20amount%20of%20policy%20freshness%20for%20significantly%20better%20throughput.%0A%0A%20%20%20%20%23%23%23%20Why%20This%20Maps%20to%20Actors%0A%0A%20%20%20%20Look%20at%20the%20async%20architecture%3A%20generators%2C%20a%20buffer%2C%20a%20trainer%20%E2%80%94%20each%0A%20%20%20%20is%20an%20independent%20process%20that%20communicates%20via%20messages.%20This%20**is**%20the%0A%20%20%20%20actor%20model%20from%20notebooks%201-2.%20Monarch%20gives%20us%3A%0A%0A%20%20%20%20-%20**Actors%20with%20endpoints**%20for%20defining%20remote%20generators%2C%20buffer%2C%20and%20trainer%0A%20%20%20%20-%20**Fault%20tolerance**%20(NB03)%20for%20keeping%20generators%20alive%20when%20GPUs%20fail%0A%20%20%20%20-%20**Services**%20(NB06)%20for%20managing%20pools%20of%20generators%20with%20health%20tracking%0A%20%20%20%20-%20**RDMA%20weight%20sync**%20(NB07)%20for%20moving%20hundreds%20of%20MB%20without%20blocking%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Interactive%20Visualization%3A%20Sync%20vs%20Async%0A%0A%20%20%20%20Adjust%20the%20parameters%20to%20see%20how%20generation%20variance%20affects%20sync%20vs%20async%20RL.%0A%0A%20%20%20%20**Note%3A**%20We%20assume%20batch%20size%20%3D%20number%20of%20generators%20(one%20trajectory%20per%20generator%20per%20batch).%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20num_generators%20%3D%20mo.ui.slider(2%2C%206%2C%20value%3D4%2C%20label%3D%22Number%20of%20generators%22)%0A%20%20%20%20mean_gen_time%20%3D%20mo.ui.slider(100%2C%20500%2C%20value%3D200%2C%20label%3D%22Mean%20generation%20time%20(ms)%22)%0A%20%20%20%20variance%20%3D%20mo.ui.slider(0.1%2C%202.0%2C%20value%3D0.8%2C%20step%3D0.1%2C%20label%3D%22Variance%20(coefficient%20of%20variation)%22)%0A%20%20%20%20train_time%20%3D%20mo.ui.slider(50%2C%20300%2C%20value%3D100%2C%20label%3D%22Training%20time%20(ms)%22)%0A%20%20%20%20weight_sync_time%20%3D%20mo.ui.slider(10%2C%20100%2C%20value%3D30%2C%20label%3D%22Weight%20sync%20time%20(ms)%22)%0A%20%20%20%20num_batches%20%3D%20mo.ui.slider(2%2C%205%2C%20value%3D3%2C%20label%3D%22Batches%20to%20simulate%22)%0A%0A%20%20%20%20mo.vstack(%5Bnum_generators%2C%20mean_gen_time%2C%20variance%2C%20train_time%2C%20weight_sync_time%2C%20num_batches%5D)%0A%20%20%20%20return%20(%0A%20%20%20%20%20%20%20%20mean_gen_time%2C%0A%20%20%20%20%20%20%20%20num_batches%2C%0A%20%20%20%20%20%20%20%20num_generators%2C%0A%20%20%20%20%20%20%20%20train_time%2C%0A%20%20%20%20%20%20%20%20variance%2C%0A%20%20%20%20%20%20%20%20weight_sync_time%2C%0A%20%20%20%20)%0A%0A%0A%40app.cell%0Adef%20_(%0A%20%20%20%20mean_gen_time%2C%0A%20%20%20%20mo%2C%0A%20%20%20%20num_batches%2C%0A%20%20%20%20num_generators%2C%0A%20%20%20%20train_time%2C%0A%20%20%20%20variance%2C%0A%20%20%20%20weight_sync_time%2C%0A)%3A%0A%20%20%20%20def%20make_interactive_comparison(n_gens_val%2C%20mean_val%2C%20var_val%2C%20train_val%2C%20sync_val%2C%20n_batch_val)%3A%0A%20%20%20%20%20%20%20%20import%20random%0A%20%20%20%20%20%20%20%20import%20matplotlib.pyplot%20as%20plt%0A%20%20%20%20%20%20%20%20import%20matplotlib.patches%20as%20mpatches%0A%20%20%20%20%20%20%20%20import%20math%0A%0A%20%20%20%20%20%20%20%20%23%20Seed%20based%20on%20slider%20values%20so%20different%20configs%20give%20different%20samples%0A%20%20%20%20%20%20%20%20random.seed(42%20%2B%20int(mean_val)%20%2B%20int(var_val%20*%20100)%20%2B%20n_gens_val%20%2B%20n_batch_val)%0A%0A%20%20%20%20%20%20%20%20%23%20Sample%20generation%20times%20from%20log-normal%20distribution%0A%20%20%20%20%20%20%20%20def%20sample_gen_times(mean%2C%20cv%2C%20n)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20if%20cv%20%3C%200.1%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20return%20%5Bmean%5D%20*%20n%0A%20%20%20%20%20%20%20%20%20%20%20%20sigma%20%3D%20math.sqrt(math.log(1%20%2B%20cv**2))%0A%20%20%20%20%20%20%20%20%20%20%20%20mu%20%3D%20math.log(mean)%20-%20sigma**2%20%2F%202%0A%20%20%20%20%20%20%20%20%20%20%20%20return%20%5Brandom.lognormvariate(mu%2C%20sigma)%20for%20_%20in%20range(n)%5D%0A%0A%20%20%20%20%20%20%20%20%23%20Generate%20times%20for%20each%20generator%20across%20batches%0A%20%20%20%20%20%20%20%20gen_times%20%3D%20%7B%7D%0A%20%20%20%20%20%20%20%20for%20g%20in%20range(n_gens_val)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20gen_times%5Bg%5D%20%3D%20sample_gen_times(mean_val%2C%20var_val%2C%20n_batch_val)%0A%0A%20%20%20%20%20%20%20%20%23%20Color%20palette%20for%20batches%0A%20%20%20%20%20%20%20%20batch_colors%20%3D%20plt.cm.Set2(range(n_batch_val))%0A%0A%20%20%20%20%20%20%20%20%23%20%3D%3D%3D%3D%3D%20SYNC%20SIMULATION%20%3D%3D%3D%3D%3D%0A%20%20%20%20%20%20%20%20sync_bars%20%3D%20%5B%5D%0A%20%20%20%20%20%20%20%20sync_time%20%3D%200%0A%0A%20%20%20%20%20%20%20%20for%20batch%20in%20range(n_batch_val)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20batch_times%20%3D%20%5Bgen_times%5Bg%5D%5Bbatch%5D%20for%20g%20in%20range(n_gens_val)%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20max_time%20%3D%20max(batch_times)%0A%0A%20%20%20%20%20%20%20%20%20%20%20%20for%20g%20in%20range(n_gens_val)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20t%20%3D%20gen_times%5Bg%5D%5Bbatch%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20sync_bars.append((g%2C%20sync_time%2C%20t%2C%20batch_colors%5Bbatch%5D))%0A%0A%20%20%20%20%20%20%20%20%20%20%20%20train_start%20%3D%20sync_time%20%2B%20max_time%0A%20%20%20%20%20%20%20%20%20%20%20%20sync_bars.append((n_gens_val%2C%20train_start%2C%20train_val%2C%20%22crimson%22))%0A%20%20%20%20%20%20%20%20%20%20%20%20%23%20Weight%20sync%20after%20training%0A%20%20%20%20%20%20%20%20%20%20%20%20sync_bars.append((n_gens_val%20%2B%201%2C%20train_start%20%2B%20train_val%2C%20sync_val%2C%20%22%239467bd%22))%0A%20%20%20%20%20%20%20%20%20%20%20%20sync_time%20%3D%20train_start%20%2B%20train_val%20%2B%20sync_val%0A%0A%20%20%20%20%20%20%20%20total_sync_time%20%3D%20sync_time%0A%0A%20%20%20%20%20%20%20%20%23%20%3D%3D%3D%3D%3D%20ASYNC%20SIMULATION%20%3D%3D%3D%3D%3D%0A%20%20%20%20%20%20%20%20%23%20In%20async%3A%20generators%20run%20continuously%2C%20pulling%20weights%20when%20new%20version%20available%0A%20%20%20%20%20%20%20%20%23%20Weight%20sync%20is%20generator-side%3A%20after%20each%20batch%2C%20if%20trainer%20has%20new%20weights%2C%20pull%20them%0A%20%20%20%20%20%20%20%20async_bars%20%3D%20%5B%5D%0A%0A%20%20%20%20%20%20%20%20%23%20Track%20when%20each%20training%20step%20completes%20(new%20weights%20available)%0A%20%20%20%20%20%20%20%20train_complete_times%20%3D%20%5B%5D%0A%20%20%20%20%20%20%20%20t%20%3D%200%0A%20%20%20%20%20%20%20%20while%20t%20%3C%20sum(gen_times%5B0%5D)%20*%201.5%3A%20%20%23%20Run%20trainer%20long%20enough%0A%20%20%20%20%20%20%20%20%20%20%20%20t%20%2B%3D%20train_val%0A%20%20%20%20%20%20%20%20%20%20%20%20train_complete_times.append(t)%0A%0A%20%20%20%20%20%20%20%20%23%20Each%20generator%3A%20generate%2C%20then%20sync%20if%20new%20weights%20available%0A%20%20%20%20%20%20%20%20for%20g%20in%20range(n_gens_val)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20t%20%3D%200%0A%20%20%20%20%20%20%20%20%20%20%20%20last_synced_version%20%3D%20-1%0A%20%20%20%20%20%20%20%20%20%20%20%20for%20batch%20in%20range(n_batch_val)%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Check%20if%20new%20weights%20available%20(trainer%20completed%20a%20step%20since%20last%20sync)%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20new_version%20%3D%20sum(1%20for%20tc%20in%20train_complete_times%20if%20tc%20%3C%3D%20t)%20-%201%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20if%20new_version%20%3E%20last_synced_version%20and%20new_version%20%3E%3D%200%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Pull%20new%20weights%20(sync%20time)%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20async_bars.append((g%2C%20t%2C%20sync_val%2C%20%22%239467bd%22))%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20t%20%2B%3D%20sync_val%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20last_synced_version%20%3D%20new_version%0A%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%23%20Generate%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20duration%20%3D%20gen_times%5Bg%5D%5Bbatch%5D%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20async_bars.append((g%2C%20t%2C%20duration%2C%20batch_colors%5Bbatch%5D))%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20t%20%2B%3D%20duration%0A%0A%20%20%20%20%20%20%20%20total_gen_time%20%3D%20max(%0A%20%20%20%20%20%20%20%20%20%20%20%20sum(gen_times%5Bg%5D)%20%2B%20sync_val%20*%20n_batch_val%20%20%23%20Rough%20upper%20bound%0A%20%20%20%20%20%20%20%20%20%20%20%20for%20g%20in%20range(n_gens_val)%0A%20%20%20%20%20%20%20%20)%0A%0A%20%20%20%20%20%20%20%20%23%20Trainer%20just%20runs%20continuously%0A%20%20%20%20%20%20%20%20t%20%3D%200%0A%20%20%20%20%20%20%20%20train_batch%20%3D%200%0A%20%20%20%20%20%20%20%20while%20t%20%3C%20total_gen_time%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20async_bars.append((n_gens_val%2C%20t%2C%20train_val%2C%20%22crimson%22))%0A%20%20%20%20%20%20%20%20%20%20%20%20t%20%2B%3D%20train_val%0A%20%20%20%20%20%20%20%20%20%20%20%20train_batch%20%2B%3D%201%0A%0A%20%20%20%20%20%20%20%20total_async_time%20%3D%20max(%0A%20%20%20%20%20%20%20%20%20%20%20%20max(bar%5B1%5D%20%2B%20bar%5B2%5D%20for%20bar%20in%20async_bars%20if%20bar%5B0%5D%20%3C%20n_gens_val)%2C%20%20%23%20Max%20generator%20end%0A%20%20%20%20%20%20%20%20%20%20%20%20t%20%20%23%20Trainer%20end%0A%20%20%20%20%20%20%20%20)%0A%0A%20%20%20%20%20%20%20%20%23%20%3D%3D%3D%3D%3D%20PLOTTING%20%3D%3D%3D%3D%3D%0A%20%20%20%20%20%20%20%20fig%2C%20axes%20%3D%20plt.subplots(1%2C%202%2C%20figsize%3D(14%2C%204.5%20%2B%20n_gens_val%20*%200.3))%0A%20%20%20%20%20%20%20%20sync_row_labels%20%3D%20%5Bf%22Gen%20%7Bi%2B1%7D%22%20for%20i%20in%20range(n_gens_val)%5D%20%2B%20%5B%22Trainer%22%2C%20%22Wt%20Sync%22%5D%0A%20%20%20%20%20%20%20%20async_row_labels%20%3D%20%5Bf%22Gen%20%7Bi%2B1%7D%22%20for%20i%20in%20range(n_gens_val)%5D%20%2B%20%5B%22Trainer%22%5D%0A%0A%20%20%20%20%20%20%20%20%23%20Sync%20RL%20plot%0A%20%20%20%20%20%20%20%20ax%20%3D%20axes%5B0%5D%0A%20%20%20%20%20%20%20%20for%20row%2C%20start%2C%20duration%2C%20color%20in%20sync_bars%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20ax.barh(row%2C%20duration%2C%20left%3Dstart%2C%20height%3D0.6%2C%20color%3Dcolor%2C%20edgecolor%3D%22black%22%2C%20linewidth%3D0.5)%0A%20%20%20%20%20%20%20%20ax.set_yticks(range(len(sync_row_labels)))%0A%20%20%20%20%20%20%20%20ax.set_yticklabels(sync_row_labels)%0A%20%20%20%20%20%20%20%20ax.set_xlabel(%22Time%20(ms)%22)%0A%20%20%20%20%20%20%20%20ax.set_title(f%22Sync%20RL%20(total%3A%20%7Btotal_sync_time%3A.0f%7Dms)%22)%0A%20%20%20%20%20%20%20%20ax.set_xlim(0%2C%20max(total_sync_time%2C%20total_async_time)%20*%201.05)%0A%20%20%20%20%20%20%20%20ax.grid(axis%3D%22x%22%2C%20alpha%3D0.3)%0A%20%20%20%20%20%20%20%20ax.invert_yaxis()%0A%0A%20%20%20%20%20%20%20%20%23%20Async%20RL%20plot%0A%20%20%20%20%20%20%20%20ax%20%3D%20axes%5B1%5D%0A%20%20%20%20%20%20%20%20for%20row%2C%20start%2C%20duration%2C%20color%20in%20async_bars%3A%0A%20%20%20%20%20%20%20%20%20%20%20%20ax.barh(row%2C%20duration%2C%20left%3Dstart%2C%20height%3D0.6%2C%20color%3Dcolor%2C%20edgecolor%3D%22black%22%2C%20linewidth%3D0.5)%0A%20%20%20%20%20%20%20%20ax.set_yticks(range(len(async_row_labels)))%0A%20%20%20%20%20%20%20%20ax.set_yticklabels(async_row_labels)%0A%20%20%20%20%20%20%20%20ax.set_xlabel(%22Time%20(ms)%22)%0A%20%20%20%20%20%20%20%20ax.set_title(f%22Async%20RL%20(total%3A%20%7Btotal_async_time%3A.0f%7Dms)%22)%0A%20%20%20%20%20%20%20%20ax.set_xlim(0%2C%20max(total_sync_time%2C%20total_async_time)%20*%201.05)%0A%20%20%20%20%20%20%20%20ax.grid(axis%3D%22x%22%2C%20alpha%3D0.3)%0A%20%20%20%20%20%20%20%20ax.invert_yaxis()%0A%0A%20%20%20%20%20%20%20%20%23%20Legend%0A%20%20%20%20%20%20%20%20patches%20%3D%20%5Bmpatches.Patch(color%3Dbatch_colors%5Bi%5D%2C%20label%3Df%22Batch%20%7Bi%2B1%7D%22)%20for%20i%20in%20range(n_batch_val)%5D%0A%20%20%20%20%20%20%20%20patches.append(mpatches.Patch(color%3D%22crimson%22%2C%20label%3D%22Train%22))%0A%20%20%20%20%20%20%20%20patches.append(mpatches.Patch(color%3D%22%239467bd%22%2C%20label%3D%22Wt%20Sync%22))%0A%20%20%20%20%20%20%20%20axes%5B1%5D.legend(handles%3Dpatches%2C%20loc%3D%22upper%20right%22%2C%20fontsize%3D8)%0A%0A%20%20%20%20%20%20%20%20plt.tight_layout()%0A%0A%20%20%20%20%20%20%20%20%23%20Calculate%20utilization%0A%20%20%20%20%20%20%20%20total_train_sync%20%3D%20n_batch_val%20*%20train_val%0A%20%20%20%20%20%20%20%20sync_util%20%3D%20total_train_sync%20%2F%20total_sync_time%20*%20100%0A%20%20%20%20%20%20%20%20async_util%20%3D%20min(100%2C%20train_batch%20*%20train_val%20%2F%20total_async_time%20*%20100)%0A%0A%20%20%20%20%20%20%20%20return%20fig%2C%20total_sync_time%2C%20total_async_time%2C%20sync_util%2C%20async_util%0A%0A%20%20%20%20fig%2C%20total_sync%2C%20total_async%2C%20sync_util%2C%20async_util%20%3D%20make_interactive_comparison(%0A%20%20%20%20%20%20%20%20num_generators.value%2C%20mean_gen_time.value%2C%20variance.value%2C%0A%20%20%20%20%20%20%20%20train_time.value%2C%20weight_sync_time.value%2C%20num_batches.value%0A%20%20%20%20)%0A%0A%20%20%20%20mo.vstack(%5B%0A%20%20%20%20%20%20%20%20fig%2C%0A%20%20%20%20%20%20%20%20mo.md(f%22%22%22%0A%20%20%20%20%7C%20Metric%20%7C%20Sync%20RL%20%7C%20Async%20RL%20%7C%0A%20%20%20%20%7C--------%7C---------%7C----------%7C%0A%20%20%20%20%7C%20Total%20time%20%7C%20%7Btotal_sync%3A.0f%7Dms%20%7C%20%7Btotal_async%3A.0f%7Dms%20%7C%0A%20%20%20%20%7C%20Trainer%20utilization%20%7C%20%7Bsync_util%3A.0f%7D%25%20%7C%20%7Basync_util%3A.0f%7D%25%20%7C%0A%20%20%20%20%7C%20**Speedup**%20%7C%20-%20%7C%20**%7Btotal_sync%2Ftotal_async%3A.1f%7Dx**%20%7C%0A%0A%20%20%20%20**Try%20increasing%20the%20variance%20slider**%20to%20see%20how%20stragglers%20hurt%20sync%20RL%20while%20async%20handles%20them%20gracefully.%0A%20%20%20%20%20%20%20%20%22%22%22)%0A%20%20%20%20%5D)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Next%20Steps%0A%0A%20%20%20%20We've%20established%20the%20task%20(Zorplex)%2C%20the%20reward%20(binary%20correctness)%2C%0A%20%20%20%20and%20the%20architecture%20(async%20RL%20with%20actors).%20Now%20we%20need%20to%20build%20it.%0A%0A%20%20%20%20The%20next%20three%20notebooks%20each%20tackle%20one%20piece%3A%0A%0A%20%20%20%20-%20**NB06%20%E2%80%94%20Services**%3A%20How%20do%20you%20manage%20a%20pool%20of%20generator%20actors%3F%0A%20%20%20%20%20%20Round-robin%20routing%2C%20health%20tracking%2C%20and%20discovery%20%E2%80%94%20so%20the%20trainer%0A%20%20%20%20%20%20can%20find%20generators%20without%20hardcoding%20addresses.%0A%20%20%20%20-%20**NB07%20%E2%80%94%20RDMA%20Weight%20Sync**%3A%20After%20each%20training%20step%2C%20generators%0A%20%20%20%20%20%20need%20fresh%20weights.%20But%20how%20do%20you%20move%20hundreds%20of%20MB%20without%20blocking%0A%20%20%20%20%20%20training%3F%20RDMA%20separates%20the%20control%20plane%20(tiny%20handle%20message)%20from%0A%20%20%20%20%20%20the%20data%20plane%20(bulk%20transfer).%0A%20%20%20%20-%20**NB08%20%E2%80%94%20Async%20RL%20E2E**%3A%20Wire%20it%20all%20together%20%E2%80%94%20generators%2C%20buffer%2C%0A%20%20%20%20%20%20trainer%2C%20weight%20sync%20%E2%80%94%20into%20a%20working%20async%20RL%20loop%20on%20Zorplex.%0A%0A%20%20%20%20---%0A%0A%20%20%20%20**Previous%3A**%20%5BNB04%20%E2%80%94%20Distributed%20Tensors%5D(.%2F04_distributed_tensors.html)%20%C2%B7%20**Next%3A**%20%5BNB06%20%E2%80%94%20Services%5D(.%2F06_services.html)%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0Aif%20__name__%20%3D%3D%20%22__main__%22%3A%0A%20%20%20%20app.run()%0A