TinyAgents - Sandboxed, Standalone Fragments

CodeTutorialAI

Prefix before I dive in. Unfortunately I cannot find the exact post, but I found some tangential names for what I will plan to describe. For anyone unfamiliar, the concept is similar to a Tweet. A signal rather than a full message. I have heard several developers write on their blogs using the concept of "artifacts" or "semaphores" and I have heard those who "curate digital gardens" call them seeds. While the names are different, the concept is the same. My last post used the term "artifact", but it is a small unit of work related to an overall system that communicates state or intent, rather than a full message.

Not only is this relevant for how I form posts, but it is also a term that has come up in the GenAI space.

Artifacts allow you to turn ideas into shareable apps, tools, or content—build tools, visualizations, and experiences by simply describing what you need. Claude can share substantial, standalone content with you in a dedicated window separate from the main conversation. This makes it easy to work with significant pieces of content that you may want to modify, build upon, or reference later. — Claude Support (Unironic emdash)

Serverless Analogy

In the last couple of years, the enterprise world seemed hell-bent on micro-service this, micro-service that. In some cases, I think micro-services really shine. However, it is stupidly difficult to represent something stateful, as something state-less. LLM's are stateless. Maintaining a conversation involves reprocessing the entire conversation up to a new message. Token caches aside, they are this way because of the model's dependency on the content before to predict the content after. If the timeline of state is so short, a single inference runs and the next could come immediately after, or it could come weeks later, why hold onto state in memory?

I remember delving into serverless because I thought the idea was really great for Homelab applications. If you, hypothetically, were running a Raspberry Pi and had a fixed amount of memory, why would you want to run a container 24/7 if you only call it 3 times a day? Well, to be quite frank, because cold-starts suck and compute tends to be more expensive than sitting memory. Likewise, these systems of micro-vm's like Firecracker tend to be extremely complex and consume their own resources, have their own overhead. Likewise, if you own a piece of hardware, what does it save you if a VM sits there occupying call it 0.5% vCPUs?

Despite my love for the concept of serverless, it does not have a large place in the Homelab world. Just because you can run say KNative, doesn't mean you should. But, by all means, run what you want in your homelab. I won't judge, although I may be jealous of your setup.

Stateful or Stateless?

Live by the sword, die by the sword.

A lot of applications cannot be represented as stateless, nor in a lot of cases, should they be. Let's say you have a client Enterprise Resource Planning (ERP) system. Does it make sense to allocate, reserve, free compute every time a call is made? Probably not. Contrarily, a blog like this, where there is no comment section nor any backend code, that is perfect for serverless. Even if it was backed by serverless, the amount of resources to serve a couple static files would be near-nothing.

When I see these AI-wrappers, I think to myself, "man this is the ideal situation for serverless." You have a defined trigger (i.e. a message, a CRON job, a function call, etc.), and you have a roughly-structured unit of work. After that unit of work completes, some of that data may persist. Depending on the use-case, the state will lie elsewhere.

We are nearing the end of the rant, but if you, the reader, cannot tell where I am going. Don't worry, we are on the cusp.

Smalled Unit of Agent

Sidebar, SAP has an AI assistant called Joule which is named after the English physicist James Prescott Joule, who ironically came up with the definition of single units of work which were named after him.

Let's say you are a nerd like me. Your morning/early afternoon consists of a check on Hacker News or Lobste.rs. Let's assume (for no reason whatsoever), that you want to automate this process (I don't, but bear with me). What would that look like from a code perspective? More than likely, it would involve hitting that endpoint (let's skip over JSON option for now), scraping the links for the top-today, visiting each link, and summarizing the contents. It would then take all of this information, combine it, and then provide a report.

This is low-hanging fruit to a LLM. Assuming you provide it a link, a process, and an expected output, this is relatively straight forward, for context, here is the defined process:

stateDiagram-v2
    direction LR
    userprompt : User prompt (Optional)
    linkParser : Link Parsing Script
    ingestScraper : Aggregate and Ingest Links
    summarizer: Summarizer Agent
    aggregator: Aggregate Markdown

    [*] --> userprompt
    userprompt --> linkParser
    linkParser --> ingestScraper
    ingestScraper --> summarizer
    summarizer --> aggregator
    aggregator --> [*]

Now let's say you allow the agent to decide.

stateDiagram-v2
    direction LR
    userprompt : User prompt (Optional)
    summarizeTool: Summarizer Agent
    aggregator: Aggregate Summaries

    [*] --> userprompt
    userprompt --> decideToolUse
    decideToolUse --> scrapeTool
    scrapeTool --> decideToolUse
    decideToolUse --> summarizeTool
    summarizeTool --> decideToolUse
    summarizeTool --> aggregator
    aggregator --> [*]

Let's make some assumptions:

Your agent is 95% deterministic and 98% accurate in tool use (these are unreasonably high expectations).
You provide the name of a site, rather than hard-coding the site.
You have to visit a single site, and you want the top 10 articles.

Each unit of work would be:

Decide to scrape
Scrape (100% success)
Summarize (100% success)

So if you visit 10 articles, your best-case scenario is 100%. That is an easy one. However the worst case scenario is slightly different. The two rates are mutually exclusive and describe multiple ways a tool can fail. One, because it misunderstood, and two because it "understood" but still chose the wrong route, or chose the right route once, but now chose a different.

P (Failure) = 0.02 + 0.05 = 0.07

P (Success) = 1 - 0.07 = 0.93 or 93%

Total success rate = {(0.93)}^{10}

Total success rate \approx 0.48

So in the worst case scenario, you are 48% accurate. Now, let's assume something different. Let's say that the LLM does not decide to use a tool explicitly, the number of articles is fixed, but a maximum, and the process is deterministic. Your success rate, once the tool is written is 100% assuming each article is reachable.

But How?

I referenced this article in another post, but I will include again: Executable Code Actions Elicit Better LLM Agents. This is a great paper, but I want to take it one step further, and not from the research side, but from the practical implementation perspective. If we have a predefined process, and we are software engineers (or worst case scenario, we are testers). We can simply have the LLM agent attempt to write something, then build the automation around it. This concept, is artifacts. And while it seems very intuitive, I don't know why more companies have not started doing this. I know several of the vibe-coding apps allow you to host, however.

The idea here is simple and builds on the prior post of a small execution environment. Before we can build some of the other tooling, let's refactor the code. Instead of using it as a server, we will reformulate it as a tool.

1import pyodideModule from "npm:pyodide/pyodide.js";
2
3const pyodide = await pyodideModule.loadPyodide();
4
5export async function runCode(
6  code: string,
7  enableLocalFS: boolean = false,
8  modules?: string[],
9) {
10  await pyodide.loadPackage("micropip");
11  const mpip = pyodide.pyimport("micropip");
12  await mpip.install(modules || []);
13
14  if (enableLocalFS) pyodide.mountNodeFS(".", "data_path");
15
16  await pyodide.runPythonAsync(`
17    import sys
18    import io
19    sys.stdout = io.StringIO()
20  `);
21
22  const returned = await pyodide
23    .runPythonAsync(code)
24    .then((out) => ({ out }))
25    .catch((err) => ({ err: err.message }));
26  const stdout = await pyodide.runPythonAsync("sys.stdout.getvalue()");
27
28  return {
29    stdout,
30    returned,
31  };
32}

There is only one small change from the other post, it allows you to mount a local path if you chose to. I actually quite like this idea. It turns out, if you attempt to have the LLM go up in the file tree, it only will find pyodide directory, the one you mount, and the code directory. So at a minimum, it does have somewhat of a small sandbox.

For the sake of this post, I will assume you bring your own LLM that is OpenAI-endpoint compatible local, or not. I will be using gpt-oss-20b, although its Harmony Template sometimes has issues. Feel free to try out with any model. Assume, for the sake of this, your function has the signature of:

1async function callLLM(messages: ChatCompletionMessageParam[]) {}

I highly recommend trimming any thinking tokens if your model returns them in the response to keep tokens, and also distractions low.

Function Calls, Structured Output, YAML, so What?

In the example of code execution, and this is speculation, I'm not sure that outputting code as a function call seems very intuitive. It is neither intuitive for me as a developer, nor do I think this is a great design for the agent. LLM's typically surround their code in Markdown fenced code-blocks. These sometimes have a language specifier, sometimes the LLM drops this.

My callout would be: The LLM will often return multiple code-blocks. Examples:

"To run, use python3 ..."
"If you want to implement X functionality, use this code..."
Here is an example of its usage: call_func()

The last is actually not a big issue, we would want that in the output. So, my brain leans towards the combination-fenced-code-blocks approach. This is super easy to implement using a regular expression:

1function concatMarkdown(message: string) {
2  const regex = /```(?:[a-zA-Z0-9]*)\n([\s\S]*?)```/g;
3
4  let match: RegExpExecArray | null;
5  let codeOutput = "";
6
7  while ((match = regex.exec(message)) !== null) {
8    codeOutput += `${match[1]}\n`;
9  }
10
11  return codeOutput.trim();
12}

We will just iterate over all the fenced code blocks, grab each match inside, and aggregate with new-line characters in between. We will trim the edges because, why not? The last thing is, I would recommend using some small helper function to format messages just to keep your code less-verbose.

1// I don't like { role: "assistant", content: responseToUser } everywhere.
2function formatMessage(role: "system" | "assistant" | "user", content: string) {
3  return { role, content };
4}

Agent Loop

The agent loop is super simple. Here are the steps as I have defiend them, but feel free to alter any way you want. We all have our opinions on styles.

Invoke
Write some code
Execute the code
Refine and continue or move on
Respond to user

Super simple. A lot of posts I see on the internet do not share prompts. I will just because I am all about learning rather than gatekeeping.

System Prompt:

1You are an expert programmer. You will be given a task, and your job is to complete it effectively
2and correctly.
3
4# Guidance
5
6- Multi-step problems benefit from planning. To plan or think, use a multi-line string in Python wrapped in
7a Markdown code block.
8- Reminder: use Python code snippets to call tools! Assume you have any
9dependencies referenced by the user already installed.
10- Follow output and rules guidelines exactly.
11
12# Rules
13- Variables defined at the top level of previous code snippets can be referenced
14in your code.
15- Do not include information about installing or running. This will be handled
16automatically.
17- Avoid speculating the output. The code output will be provided to you afterwards.
18- You must write code once. Do not respond directly with the answer.
19
20# Output Format
21- Python code snippet that provides the solution to the task, or a step towards the solution.
22Any output you want to extract from the code should be printed to the console.
23Code MUST be output in a fenced code block.
24- Text to be shown directly to the user, if you want to ask for more information or
25provide the final answer. Do NOT use fenced code blocks in this case.`;

Code Result Prompt:

1Code output:
2{{OUTPUT}}
3User's prompt:
4{{PROMPT}}
5
6Reflect on the code written and the output. You must make a decision based on the code you wrote and the output.
7
8If the output matches expectations, then respond with output without using fenced code blocks.
9If the output is not ready, iterate until you have completed the task.

The test prompt will be what I referenced earlier, leveraging the file system and mounted directory. This is super simple, but just to give you an idea of that functionality first:

1Find the hidden text file and tell me the message!

The file contains the content: "You found me!"

Loop Time

1async function runAgent() {
2  let hasUsedTool = false;
3  const MAX_TURNS = 5;
4  const history: ChatCompletionMessageParam[] = [
5    formatMessage("system", SYSTEM_PROMPT),
6    formatMessage("user", TEST_PROMPT),
7  ];
8
9  const logCode = (label: string, code: string) => {
10    if (console.groupCollapsed) {
11      console.groupCollapsed(`🔧 ${label}`);
12      console.log(code);
13      console.groupEnd();
14    } else {
15      console.log(`--- ${label} ---\n${code}\n---\n`);
16    }
17  };
18
19  for (let turnIdx = 0; turnIdx < MAX_TURNS; turnIdx++) {
20    const out = await callLLM(history);
21    const md = concatMarkdown(out.content!);
22
23    if (md.length < 1 && hasUsedTool) {
24      // LLM wants to output something to us.
25      history.push(formatMessage("assistant", out.content!));
26      break;
27    } else if (md.length < 1 && !hasUsedTool) {
28      // This will happen if the LLM does not output Python on the
29      // first call properly.
30      continue;
31    }
32
33    hasUsedTool = true;
34
35    history.push(out);
36    logCode("Wrote code", md);
37
38    // LLM wants to execute code.
39
40    const res = await runCode(md, true, []);
41
42    const execOutput = STDOUT_OR_RETURN_PROMPT.replace(
43      "{{OUTPUT}}",
44      res.stdout || res.returned,
45    );
46
47    history.push(formatMessage("user", execOutput));
48  }
49
50  console.log(`Output:\n${history[history.length - 1]!.content}`);
51}

Run results:

1Wrote code
2  import os
3
4  # Search for hidden files in the current directory and its subdirectories
5  hidden_files = []
6  for root, dirs, files in os.walk('.'):
7      for file in files:
8          if file.startswith('.'):
9              hidden_files.append(os.path.join(root, file))
10
11  # Print the found hidden files
12  print("Hidden files found:")
13  for file in hidden_files:
14      print(file)
15
16  # Check if any hidden file contains a message
17  for file_path in hidden_files:
18      try:
19          with open(file_path, 'r', encoding='utf-8') as file:
20              content = file.read()
21              if content.strip():
22                  print(f"\nMessage found in {file_path}:")
23                  print(content)
24                  break
25      except Exception as e:
26          continue
27Loading micropip
28Loaded micropip
29Output:
30You found me!

Introducing a Tad More Complexity

Let's modify the test prompt with something a bit more complicated.

1Goal: Summarize the top 5 articles today from HackerNews.
2
3Steps:
41. Pull the top HackerNews articles for today using their JSON API.
52. Get the HTML for each site using requests.
63. Markdownify each result
74. Calling 'await summarize(result)' for each result. Do not use a placeholder or implement. I will provide
8this code.
9
10Finally, write the concatenated results as a single Markdown file.

In order to allow the Python code to invoke a LLM, I added a module.

1async function summarize(toSummarize: string) {
2  const messages = [
3    formatMessage("system", "You are an expert summarizer."),
4    formatMessage(
5      "user",
6      `Summarize the following, only output the summary: ${toSummarize}`,
7    ),
8  ];
9
10  const res = await callLLM(messages);
11  return res.content;
12}
13
14pyodide.registerJsModule("llm", { summarize });

We can then include the import by default:

1await pyodide.runPythonAsync(`from llm import summarize`);

The run looks good with GPT-OSS-20b, it hits it on its first shot. Qwen-3-30b-a3b had some issues where it was convinced it had to write the summarize function on its own each time. Honestly the headlines of one of these articles is nutty. "Running Doom on Pinebuds Pro Earbuds" is a wild title. Reminds me of "Running Doom on 100 pounds of mold potatoes"

wtf

View LLM Generated Python

1import asyncio
2import datetime
3from typing import List
4
5import requests
6from markdownify import MarkdownConverter
7
8
9def html_to_md(html: str) -> str:
10    return MarkdownConverter().convert(html)
11
12
13async def main() -> None:
14    top_stories_url = "https://hacker-news.firebaseio.com/v0/topstories.json"
15    try:
16        ids_resp = requests.get(top_stories_url, timeout=10)
17        ids_resp.raise_for_status()
18        top_ids: List[int] = ids_resp.json()
19    except Exception as e:
20        print(f"Failed to fetch top stories IDs: {e}")
21        return
22
23    today_utc = datetime.datetime.utcnow().date()
24    summaries: List[str] = []
25    user_agent = "Mozilla/5.0 (compatible; HNSummarizer/1.0)"
26    headers = {"User-Agent": user_agent}
27
28    for story_id in top_ids:
29        if len(summaries) >= 5:
30            break
31
32        item_url = f"https://hacker-news.firebaseio.com/v0/item/{story_id}.json"
33        try:
34            item_resp = requests.get(item_url, timeout=10)
35            item_resp.raise_for_status()
36            item_data = item_resp.json()
37        except Exception as e:
38            print(f"Failed to fetch story {story_id}: {e}")
39            continue
40
41        if not item_data.get("url") or "time" not in item_data:
42            continue
43
44        story_time = datetime.datetime.utcfromtimestamp(item_data["time"]).date()
45        if story_time != today_utc:
46            continue
47
48        url = item_data["url"]
49        title = item_data.get("title", "Untitled")
50
51        try:
52            page_resp = requests.get(url, headers=headers, timeout=10)
53            page_resp.raise_for_status()
54            html_content = page_resp.text
55        except Exception as e:
56            print(f"Failed to fetch URL {url}: {e}")
57            continue
58
59        md_content = html_to_md(html_content)
60
61        summary_text = await summarize(md_content)  # type: ignore
62        if not isinstance(summary_text, str):
63            try:
64                import json
65
66                summary_text = json.dumps(summary_text)
67            except Exception:
68                summary_text = str(summary_text)
69
70        summaries.append(f"### {title}\n\n{summary_text}")
71
72    if not summaries:
73        print("No HackerNews stories found for today.")
74        return
75
76    final_markdown = "\n\n---\n\n".join(summaries)
77    output_file = "hn_top5.md"
78    try:
79        with open(output_file, "w", encoding="utf-8") as f:
80            f.write(final_markdown)
81        print(f"Summaries written to {output_file}")
82    except Exception as e:
83        print(f"Failed to write output file: {e}")
84
85
86if __name__ == "__main__":
87    asyncio.run(main())

Expand to see rendered Markdown

A macOS app that blurs your screen when you slouch

Posturr is a lightweight macOS utility that monitors your posture in real‑time using the Mac’s camera and Apple’s Vision framework. When it detects you slouching, it progressively blurs all open windows to nudge you back into good form. The app runs as a background menu‑bar icon, offers adjustable sensitivity, calibration, and multi‑display support, and keeps all video processing local with no cloud tracking. It can be installed from the released zip or built from source on macOS 13+.

Using PostgreSQL as a Dead Letter Queue for Event-Driven Systems

The article describes how a Wayfair project used CloudSQL PostgreSQL as a Dead Letter Queue (DLQ) for an event‑driven pipeline that ingests Kafka events, hydrates them via downstream APIs, and stores enriched data in PostgreSQL. When processing failures occurred—API outages, consumer crashes, malformed payloads—the team redirected failed events to a dedicated dlq_events table instead of dropping or requeueing them in Kafka. The table stored raw JSONB payloads, error details, status (PENDING/SUCCEEDED), retry count and delay timestamps, with indexes on status, retry time, event type, and creation date to support efficient querying and retries.

A retry scheduler, protected by ShedLock to run only once across multiple instances, periodically selects eligible PENDING rows using a SELECT … FOR UPDATE SKIP LOCKED query. It processes up to 50 events every six hours, increments retry counters, updates status on success, and respects a maximum retry limit of 240 attempts. This approach keeps failures visible for debugging via simple SQL queries, avoids retry storms, and allows automatic reprocessing when downstream services recover.

Overall, the solution leverages Kafka’s strengths in high‑throughput ingestion while using PostgreSQL’s durability and queryability to handle failures predictably, reducing operational stress and making failure handling a routine, observable part of the system.

Doom has been ported to an earbud

DOOMBUDS – Running Doom on Pinebuds Pro Earbuds

What it is: A remote‑play version of the classic 1993 game DOOM that runs entirely inside the firmware of Pinebuds Pro earbuds (the only earbuds with open‑source firmware). Players queue online and play via a browser interface, with gameplay streamed back as low‑latency MJPEG over Twitch once enough users are in the queue.
Architecture
1. Doom port compiled for the earbuds’ Cortex‑M4F CPU.
2. Serial server bridges the UART connection to a web server and transcodes the video stream to Twitch.
3. Web server serves assets, manages the player queue, forwards key presses, and delivers the MJPEG feed.
4. Static front‑end (HTML/JS) that displays the game screen, handles controls, and shows the queue status.
Key technical challenges & solutions
Serial bandwidth: UART gives ~2.4 Mbps → raw framebuffer (~96 kB per frame) would only allow ~3 fps.
Compression: Use JPEG encoding (via an embedded encoder) to send each frame as a JPEG; average size ~11–13.5 kB, yielding theoretical 22‑27 fps.
CPU: Stock firmware runs at 100 MHz; boosted to 300 MHz. The Cortex‑M4F can run DOOM but is limited by JPEG encoding (~18 fps).
RAM: Base RAM 768 KB → ~992 KB after disabling the coprocessor. Extensive optimizations (pre‑computed tables, const‑variables in flash, no caching) reduce Doom’s memory footprint from 4 MB to fit within the limit.
Flash storage: The full DOOM shareware WAD is 4.2 MB but the earbuds have only 4 MB of flash. Using Fragglet’s “Squashware” trimmed‑down 1.7 MB WAD makes it possible.
Usage: Anyone can clone the repos –
DOOMBuds (firmware) and DOOMBUDS-JS (web client).
The project also provides a queue interface, key‑binding instructions, and Twitch‑based streaming for bandwidth savings.
Extras: A forthcoming article/video will dive deeper into the implementation details. Links to the developer’s LinkedIn are included for hiring interest.

FAA institutes nationwide drone no-fly zones around ICE operations

The FAA has issued NOTAM FDC 6/4375, creating a nationwide moving drone‑no‑fly zone around all Immigration & Customs Enforcement (ICE) and other Department of Homeland Security (DHS) mobile assets—including vehicle convoys and escorts. Drones are prohibited within 3,000 ft laterally and 1,000 ft vertically of these facilities or vehicles at all times; the restriction moves with the assets and is not tied to fixed coordinates or time windows. The zone is classified as “national defense airspace” under federal security statutes, and violators may face criminal prosecution, civil penalties, administrative action, or revocation of FAA privileges; drones deemed a credible threat can be seized or destroyed. Limited exceptions exist for operations directly supporting national defense, homeland security, law‑enforcement, firefighting, search‑and‑rescue, or disaster response missions that receive prior coordination with DHS or the FAA. The move has drawn criticism from drone operators and civil‑liberties groups who note the lack of real‑time visibility into where ICE convoys operate, raising concerns about inadvertent violations.

Continuation

So how does this line up with the original article introduction? The idea that I have, is that any LLM generated process (in this example a block of code, several dependencies, and a mounted folder that produces a file) exists as an artifact. And, instead of making your agent grapple for tools, reasoning chains, complex planning and execution, have it just write some Python. That Python output can then be re-used, scheduled, deployed (with great caution), or just discarded to be written again.

If you made it this far, thanks for reading. Don't let big-AI companies sell you on agents like it's some black-magic you could never produce yourself with cheaper models, more confined infrastructure, less time, etc. There is a massive knowledge gap between "we are AI-native" and "I get agents, but if I had to code one, I would have very little idea of how to get effective output from them."

Cheers.