A Coding Guide to Instrumenting, Tracing, and Evaluating LLM Applications Using TruLens and OpenAI Models

In this tutorial, we focus on building a transparent and measurable evaluation pipeline for large language model applications using TruLens. Rather than treating LLMs as black boxes, we instrument each stage of an application so that inputs, intermediate steps, and outputs are captured as structured traces. We then attach feedback functions that quantitatively evaluate model behavior along dimensions such as relevance, grounding, and contextual alignment. By running multiple application variants under the same evaluation setup, we show how TruLens enables disciplined experimentation, reproducibility, and data-driven improvement of LLM systems.

Copy Code

!pip -q install trulens trulens-providers-openai chromadb openai


import os, re, getpass
from dataclasses import dataclass
from typing import List, Dict, Any
import numpy as np


import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction


from openai import OpenAI


from trulens.core import TruSession, Feedback
from trulens.providers.openai import OpenAI as TruOpenAI
from trulens.apps.app import TruApp
from trulens.core.otel.instrument import instrument
from trulens.otel.semconv.trace import SpanAttributes
from trulens.dashboard import run_dashboard


if not os.environ.get("OPENAI_API_KEY"):
   os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OPENAI_API_KEY (input hidden): ")

We prepare the Colab environment by installing all required libraries and importing the core dependencies used throughout the tutorial. We securely read the OpenAI API key from the terminal to avoid hardcoding sensitive credentials. We also initialize the foundational tooling that enables tracing, feedback evaluation, and dashboard visualization.

Copy Code

def normalize_ws(s: str) -> str:
   return re.sub(r"\s+", " ", s).strip()


RAW_DOCS = [
   {
       "doc_id": "trulens_core",
       "title": "TruLens core idea",
       "text": "TruLens is used to track and evaluate LLM applications. It can log app runs, compute feedback scores, and provide a dashboard to compare versions and investigate traces and results."
   },
   {
       "doc_id": "trulens_feedback",
       "title": "Feedback functions",
       "text": "TruLens feedback functions can score groundedness, context relevance, and answer relevance. They are configured by specifying which parts of an app record should be used as inputs."
   },
   {
       "doc_id": "trulens_rag",
       "title": "RAG workflow",
       "text": "A typical RAG system retrieves relevant chunks from a vector database and then generates an answer using those chunks as context. The quality depends on retrieval, prompt design, and generation behavior."
   },
   {
       "doc_id": "trulens_instrumentation",
       "title": "Instrumentation",
       "text": "Instrumentation adds tracing spans to your app functions (like retrieval and generation). This makes it possible to analyze which contexts were retrieved, latency, token usage, and connect feedback evaluations to specific steps."
   },
   {
       "doc_id": "vectorstores",
       "title": "Vector stores and embeddings",
       "text": "Vector stores index embeddings for text chunks, enabling semantic search. OpenAI embedding models can be used to embed chunks and queries, and Chroma can store them locally in memory for a notebook demo."
   },
   {
       "doc_id": "prompting",
       "title": "Prompting and citations",
       "text": "Prompting can encourage careful, citation-grounded answers. A stronger prompt can enforce: answer only from context, be explicit about uncertainty, and provide short citations that map to retrieved chunks."
   },
]


@dataclass
class Chunk:
   chunk_id: str
   doc_id: str
   title: str
   text: str
   meta: Dict[str, Any]


def chunk_docs(docs, chunk_size=350, overlap=80) -> List[Chunk]:
   chunks: List[Chunk] = []
   for d in docs:
       text = normalize_ws(d["text"])
       start = 0
       idx = 0
       while start < len(text):
           end = min(len(text), start + chunk_size)
           chunk_text = text[start:end]
           chunk_id = f'{d["doc_id"]}_c{idx}'
           chunks.append(
               Chunk(
                   chunk_id=chunk_id,
                   doc_id=d["doc_id"],
                   title=d["title"],
                   text=chunk_text,
                   meta={"doc_id": d["doc_id"], "title": d["title"], "chunk_index": idx},
               )
           )
           idx += 1
           start = end - overlap
           if start < 0:
               start = 0
           if end == len(text):
               break
   return chunks


CHUNKS = chunk_docs(RAW_DOCS)

We define the raw knowledge sources and implement a clean, reusable text-chunking pipeline. We normalize document text and split it into overlapping chunks to preserve semantic continuity during retrieval. We structure each chunk with metadata so it can later be traced, evaluated, and cited during RAG execution.

Copy Code

EMBED_MODEL = "text-embedding-3-small"
embedding_function = OpenAIEmbeddingFunction(
   api_key=os.environ.get("OPENAI_API_KEY"),
   model_name=EMBED_MODEL,
)


chroma_client = chromadb.Client()
collection = chroma_client.get_or_create_collection(
   name="trulens_demo_kb",
   embedding_function=embedding_function,
)


ids = [c.chunk_id for c in CHUNKS]
docs = [c.text for c in CHUNKS]
metas = [c.meta for c in CHUNKS]
collection.add(ids=ids, documents=docs, metadatas=metas)


oai_client = OpenAI()


def format_context(hits):
   lines = []
   for i, h in enumerate(hits):
       meta = h["meta"]
       lines.append(
           f"[C{i}] ({meta.get('title','')}, {meta.get('doc_id','')}, chunk={meta.get('chunk_index','?')}): {h['text']}"
       )
   return "\n".join(lines)

We create the vector database using Chroma and OpenAI embeddings to enable semantic search over the chunked knowledge base. We insert all chunks into the collection and prepare the OpenAI client for downstream generation. We also define a context-formatting utility that converts retrieved chunks into a structured prompt-ready format.

Copy Code

class RAG:
   def __init__(self, *, gen_model: str, prompt_style: str = "base", k: int = 4):
       self.gen_model = gen_model
       self.prompt_style = prompt_style
       self.k = k


   @instrument(
       span_type=SpanAttributes.SpanType.RETRIEVAL,
       attributes={
           SpanAttributes.RETRIEVAL.QUERY_TEXT: "query",
           SpanAttributes.RETRIEVAL.RETRIEVED_CONTEXTS: "return",
       },
   )
   def retrieve(self, query: str) -> list:
       res = collection.query(query_texts=[query], n_results=self.k)
       hits = []
       for i in range(len(res["ids"][0])):
           hits.append(
               {
                   "id": res["ids"][0][i],
                   "text": res["documents"][0][i],
                   "meta": res["metadatas"][0][i],
               }
           )
       return hits


   @instrument(span_type=SpanAttributes.SpanType.GENERATION)
   def generate(self, query: str, hits: list) -> str:
       if not hits:
           return "I don't have enough relevant information in the knowledge base to answer."


       context = format_context(hits)


       if self.prompt_style == "strict_citations":
           system = (
               "You are a careful assistant. Use ONLY the provided context. "
               "If the context is insufficient, say so. "
               "When you make a claim, cite it with [C#] tags matching the context chunks."
           )
           user = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer (with [C#] citations):"
       else:
           system = "You are a helpful assistant."
           user = f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer using the context above:"


       resp = oai_client.chat.completions.create(
           model=self.gen_model,
           messages=[
               {"role": "system", "content": system},
               {"role": "user", "content": user},
           ],
       )
       out = resp.choices[0].message.content
       return out if out else "No answer returned."


   @instrument(
       span_type=SpanAttributes.SpanType.RECORD_ROOT,
       attributes={
           SpanAttributes.RECORD_ROOT.INPUT: "query",
           SpanAttributes.RECORD_ROOT.OUTPUT: "return",
       },
   )
   def query(self, query: str) -> str:
       hits = self.retrieve(query=query)
       return self.generate(query=query, hits=hits)

We implement the core RAG application with explicit instrumentation on retrieval, generation, and the request root. We capture queries, retrieved contexts, and generated outputs as traceable spans for later evaluation. We also support multiple prompt styles, allowing us to systematically compare different prompting strategies under identical conditions.

Copy Code

session = TruSession()
session.reset_database()


EVAL_MODEL = "gpt-4o-mini"
provider = TruOpenAI(model_engine=EVAL_MODEL)


f_groundedness = (
   Feedback(
       provider.groundedness_measure_with_cot_reasons_consider_answerability,
       name="Groundedness",
   )
   .on_context(collect_list=True)
   .on_output()
   .on_input()
)


f_answer_relevance = (
   Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
   .on_input()
   .on_output()
)


f_context_relevance = (
   Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
   .on_input()
   .on_context(collect_list=False)
   .aggregate(np.mean)
)


GEN_MODEL = "gpt-4o-mini"


rag_base = RAG(gen_model=GEN_MODEL, prompt_style="base", k=4)
rag_strict = RAG(gen_model=GEN_MODEL, prompt_style="strict_citations", k=4)


tru_base = TruApp(
   rag_base,
   app_name="TruLens-RAG",
   app_version="v1_base_prompt",
   feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)


tru_strict = TruApp(
   rag_strict,
   app_name="TruLens-RAG",
   app_version="v2_strict_citations",
   feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)


EVAL_QUERIES = [
   "What is TruLens used for?",
   "What are the three common RAG feedbacks to evaluate?",
   "Why does instrumentation matter in RAG evaluation?",
   "What role do embeddings play in a vector store?",
   "How can prompting encourage grounded answers?",
]


with tru_base as recording:
   for q in EVAL_QUERIES:
       rag_base.query(q)


with tru_strict as recording:
   for q in EVAL_QUERIES:
       rag_strict.query(q)


leaderboard = session.get_leaderboard()
print(leaderboard)


run_dashboard(session)

We configure the TruLens evaluation session and define feedback functions for groundedness, answer relevance, and context relevance. We run multiple versions of the RAG system across a shared evaluation set to generate comparable records. We then surface the results through the leaderboard and interactive dashboard to analyze performance differences and reasoning quality.

In conclusion, we established a practical workflow for understanding and evaluating LLM behavior beyond surface-level outputs. We demonstrated how instrumentation turns every model call into an inspectable artifact and how feedback functions convert subjective judgments into consistent metrics. Through versioned runs, leaderboards, and dashboards, we can compare design choices with clarity and confidence. This tutorial lays the groundwork for building reliable, auditable, and continuously improving LLM applications in real-world settings where trust and explainability matter as much as performance.

Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post A Coding Guide to Instrumenting, Tracing, and Evaluating LLM Applications Using TruLens and OpenAI Models appeared first on MarkTechPost.

from MarkTechPost https://ift.tt/NoFXMLT
via IFTTT

World Wire

A Coding Guide to Instrumenting, Tracing, and Evaluating LLM Applications Using TruLens and OpenAI Models

Comments

Post a Comment

Popular posts from this blog

Implementing Persistent Memory Using a Local Knowledge Graph in Claude Desktop

Microsoft AI Proposes BitNet Distillation (BitDistill): A Lightweight Pipeline that Delivers up to 10x Memory Savings and about 2.65x CPU Speedup

Technical Deep Dive: Automating LLM Agent Mastery for Any MCP Server with MCP- RL and ART