A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation

In this tutorial, we work with Microsoft’s OpenMementos dataset and explore how reasoning traces are structured through blocks and mementos in a practical, Colab-ready workflow. We stream the dataset efficiently, parse its special-token format, inspect how reasoning and summaries are organized, and measure the compression provided by the memento representation across different domains. As we move through the analysis, we also visualize dataset patterns, align the streamed format with the richer full subset, simulate inference-time compression, and prepare the data for supervised fine-tuning. In this way, we build both an intuitive and technical understanding of how OpenMementos captures long-form reasoning while preserving compact summaries that can support efficient training and inference.

!pip install -q -U datasets transformers matplotlib pandas


import re, itertools, textwrap
from collections import Counter
from typing import Dict
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset


DATASET = "microsoft/OpenMementos"


ds_stream = load_dataset(DATASET, split="train", streaming=True)
first_row = next(iter(ds_stream))
print("Columns     :", list(first_row.keys()))
print("Domain      :", first_row["domain"], "| Source:", first_row["source"])
print("Problem head:", first_row["problem"][:160].replace("\n", " "), "...")

We install the required libraries and import the core tools needed for dataset streaming, parsing, analysis, and visualization. We then connect to the Microsoft OpenMementos dataset in streaming mode to inspect it without downloading the entire dataset locally. By reading the first example, we begin understanding the dataset schema, the problem format, and the domain and source metadata attached to each reasoning trace.

BLOCK_RE   = re.compile(r"<\|block_start\|>(.*?)<\|block_end\|>",     re.DOTALL)
SUMMARY_RE = re.compile(r"<\|summary_start\|>(.*?)<\|summary_end\|>", re.DOTALL)
THINK_RE   = re.compile(r"<think>(.*?)</think>",                      re.DOTALL)


def parse_memento(response: str) -> Dict:
   blocks    = [m.strip() for m in BLOCK_RE.findall(response)]
   summaries = [m.strip() for m in SUMMARY_RE.findall(response)]
   think_m   = THINK_RE.search(response)
   final_ans = response.split("</think>")[-1].strip() if "</think>" in response else ""
   return {"blocks": blocks, "summaries": summaries,
           "reasoning": (think_m.group(1) if think_m else ""),
           "final_answer": final_ans}


parsed = parse_memento(first_row["response"])
print(f"\n→ {len(parsed['blocks'])} blocks, {len(parsed['summaries'])} mementos parsed")
print("First block   :", parsed["blocks"][0][:140].replace("\n", " "), "...")
print("First memento :", parsed["summaries"][0][:140].replace("\n", " "), "...")


N_SAMPLES = 500
rows = []
for i, ex in enumerate(itertools.islice(
       load_dataset(DATASET, split="train", streaming=True), N_SAMPLES)):
   p = parse_memento(ex["response"])
   if not p["blocks"] or len(p["blocks"]) != len(p["summaries"]):
       continue
   blk_c = sum(len(b) for b in p["blocks"])
   sum_c = sum(len(s) for s in p["summaries"])
   blk_w = sum(len(b.split()) for b in p["blocks"])
   sum_w = sum(len(s.split()) for s in p["summaries"])
   rows.append(dict(domain=ex["domain"], source=ex["source"],
                    n_blocks=len(p["blocks"]),
                    block_chars=blk_c, summ_chars=sum_c,
                    block_words=blk_w, summ_words=sum_w,
                    compress_char=sum_c / max(blk_c, 1),
                    compress_word=sum_w / max(blk_w, 1)))
   if (i + 1) % 100 == 0:
       print(f"  processed {i+1}/{N_SAMPLES}")


df = pd.DataFrame(rows)
print(f"\nAnalyzed {len(df)} rows. Domain counts:")
print(df["domain"].value_counts().to_string())


per_dom = df.groupby("domain").agg(
   n=("domain", "count"),
   median_blocks=("n_blocks", "median"),
   median_block_words=("block_words", "median"),
   median_summ_words=("summ_words", "median"),
   median_char_ratio=("compress_char", "median"),
   median_word_ratio=("compress_word", "median"),
).round(3)
print("\nPer-domain medians (ratio = mementos / blocks):")
print(per_dom.to_string())

We define the regex-based parser that extracts reasoning blocks, memento summaries, the main thinking section, and the final answer from each response. We test the parser on the first streamed example and confirm that the block-summary structure is being captured correctly. We then run a streaming analysis over multiple samples to compute block counts, word counts, character counts, and compression ratios, which helps us study how the dataset behaves across examples and domains.

def compress_trace(response: str, keep_last_k: int = 1) -> str:
   blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
   if not blocks or len(blocks) != len(summaries):
       return response
   out, n = ["<think>"], len(blocks)
   for i, (b, s) in enumerate(zip(blocks, summaries)):
       if i >= n - keep_last_k:
           out.append(f"<|block_start|>{b}<|block_end|>")
           out.append(f"<|summary_start|>{s}<|summary_end|>")
       else:
           out.append(f"<|summary_start|>{s}<|summary_end|>")
   out.append("</think>")
   out.append(response.split("</think>")[-1])
   return "\n".join(out)


orig, comp = first_row["response"], compress_trace(first_row["response"], 1)
print(f"\nOriginal   : {len(orig):>8,} chars")
print(f"Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of original)")


from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
MEM_TOKENS = ["<|block_start|>", "<|block_end|>",
             "<|summary_start|>", "<|summary_end|>",
             "<think>", "</think>"]
tok.add_special_tokens({"additional_special_tokens": MEM_TOKENS})


def tlen(s): return len(tok(s, add_special_tokens=False).input_ids)


blk_tok = sum(tlen(b) for b in parsed["blocks"])
sum_tok = sum(tlen(s) for s in parsed["summaries"])
print(f"\nTrace-level token compression for this example:")
print(f"  block tokens    = {blk_tok}")
print(f"  memento tokens  = {sum_tok}")
print(f"  compression     = {blk_tok / max(sum_tok,1):.2f}×  (paper reports ~6×)")


def to_chat(ex):
   return {"messages": [
       {"role": "user",      "content": ex["problem"]},
       {"role": "assistant", "content": ex["response"]},
   ]}
chat_stream = load_dataset(DATASET, split="train", streaming=True).map(to_chat)
chat_ex = next(iter(chat_stream))
print("\nSFT chat example (truncated):")
for m in chat_ex["messages"]:
   print(f"  [{m['role']:9s}] {m['content'][:130].replace(chr(10),' ')}...")

We visualize the dataset’s structural patterns by plotting block counts, compression ratios, and the relationship between block size and memento size. We compare these distributions across domains to see how reasoning organization differs between math, code, and science examples. We also stream one example from the full subset and inspect its additional sentence-level and block-alignment fields, which helps us understand the richer internal annotation pipeline behind the dataset.

def compress_trace(response: str, keep_last_k: int = 1) -> str:
   blocks, summaries = BLOCK_RE.findall(response), SUMMARY_RE.findall(response)
   if not blocks or len(blocks) != len(summaries):
       return response
   out, n = ["<think>"], len(blocks)
   for i, (b, s) in enumerate(zip(blocks, summaries)):
       if i >= n - keep_last_k:
           out.append(f"<|block_start|>{b}<|block_end|>")
           out.append(f"<|summary_start|>{s}<|summary_end|>")
       else:
           out.append(f"<|summary_start|>{s}<|summary_end|>")
   out.append("</think>")
   out.append(response.split("</think>")[-1])
   return "\n".join(out)


orig, comp = first_row["response"], compress_trace(first_row["response"], 1)
print(f"\nOriginal   : {len(orig):>8,} chars")
print(f"Compressed : {len(comp):>8,} chars ({len(comp)/len(orig)*100:.1f}% of original)")


from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
MEM_TOKENS = ["<|block_start|>", "<|block_end|>",
             "<|summary_start|>", "<|summary_end|>",
             "<think>", "</think>"]
tok.add_special_tokens({"additional_special_tokens": MEM_TOKENS})


def tlen(s): return len(tok(s, add_special_tokens=False).input_ids)


blk_tok = sum(tlen(b) for b in parsed["blocks"])
sum_tok = sum(tlen(s) for s in parsed["summaries"])
print(f"\nTrace-level token compression for this example:")
print(f"  block tokens    = {blk_tok}")
print(f"  memento tokens  = {sum_tok}")
print(f"  compression     = {blk_tok / max(sum_tok,1):.2f}×  (paper reports ~6×)")


def to_chat(ex):
   return {"messages": [
       {"role": "user",      "content": ex["problem"]},
       {"role": "assistant", "content": ex["response"]},
   ]}
chat_stream = load_dataset(DATASET, split="train", streaming=True).map(to_chat)
chat_ex = next(iter(chat_stream))
print("\nSFT chat example (truncated):")
for m in chat_ex["messages"]:
   print(f"  [{m['role']:9s}] {m['content'][:130].replace(chr(10),' ')}...")

We simulate inference-time compression by rewriting a reasoning trace so that older blocks are replaced by their mementos while the latest blocks remain intact. We then compare the original and compressed trace lengths to see how much context can be reduced in practice. After that, we integrate a tokenizer, add special memento tokens, measure token-level compression, and convert the dataset to an SFT-style chat format suitable for training workflows.

def render_trace(response: str, width: int = 220) -> None:
   p = parse_memento(response)
   print("=" * 72)
   print(f"{len(p['blocks'])} blocks · {len(p['summaries'])} mementos")
   print("=" * 72)
   for i, (b, s) in enumerate(zip(p["blocks"], p["summaries"]), 1):
       ratio = len(s) / max(len(b), 1) * 100
       print(f"\n▶ BLOCK {i}  ({len(b):,} chars)")
       print(textwrap.indent(textwrap.shorten(b.replace("\n", " "), width=width), "  "))
       print(f"◀ MEMENTO {i}  ({len(s):,} chars · {ratio:.1f}% of block)")
       print(textwrap.indent(textwrap.shorten(s.replace("\n", " "), width=width), "  "))
   if p["final_answer"]:
       print("\n★ FINAL ANSWER")
       print(textwrap.indent(textwrap.shorten(p["final_answer"].replace("\n"," "),
                                              width=width*2), "  "))


render_trace(first_row["response"])

We build a pretty-printer that renders a single reasoning trace in a much more readable block-by-block format. We display each block alongside its paired memento and calculate the summary’s size relative to the original block, making the compression effect easy to inspect manually. By running this renderer on the first example, we create a clean qualitative view of how OpenMementos organizes reasoning and preserves essential information through summaries.

In conclusion, we gained a clear view of how OpenMementos represents reasoning as a sequence of detailed blocks paired with concise mementos, and we saw why this structure is useful for context compression. We parsed real examples, computed domain-level statistics, compared block and summary lengths, and observed how compressed traces can reduce token usage while still retaining key information. We also aligned the streamed dataset format with the full subset, converted the data to an SFT-ready chat structure, and built tools to more clearly inspect traces. Through this end-to-end workflow, we understand the dataset itself and see how it can serve as a practical foundation for studying reasoning traces, memory-style summarization, and efficient long-context model behavior.


Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

The post A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation appeared first on MarkTechPost.



from MarkTechPost https://ift.tt/zlfGNtr
via IFTTT

Comments

Popular posts from this blog

Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup

Technical Deep Dive: Automating LLM Agent Mastery for Any MCP Server with MCP- RL and ART

Google AI Releases LangExtract: An Open Source Python Library that Extracts Structured Data from Unstructured Text Documents