,

2 — Of Local Models and Ollama

Hey, I’m Robert. I’m a developer, a recovering editorial director, and guy figuring out how to build a chief of staff to run all sorts of aspects of his life, often (but not always) using AI. If you’re new here, welcome!

What’s inside

OpenClaw not necessarily required · Running local models with Ollama · Models and parameters · The world’s simplest and least useful local prompt router

There’s been much talk about OpenClaw, nee ClawedBot and etc.. My experience with OpenClaw so far is that it’s arguably more trouble than it’s worth these days. That’s not because it doesn’t do very interesting things once you’ve got it up and running, but right off the bat, before you can do any of those interesting things, you’ll find yourself having to set up a second phone number (a real one, for which you are paying a monthly fee). Or rather, you don’t have to do this, but this is what enables you to spin up Whatsapp on your mobile phone and chat with your little monster. And by many accounts, this is one of two magic things about OpenClaw.

The other magic thing is that it will automatically runs jobs for you, which is at least theoretically great. And what does that enable you to? The daily job that people keep nattering on about is a daily morning update. I’m not saying it’s not handy to have such a thing (though, hmm, maybe I am arguing that–don’t read it before you’ve done some useful work without getting distracted), but for this we bought a mini computer?

Meanwhile, the fancier and more powerful these repeating jobs OpenClaw’s performing for you get, by definition the more tokens you’re burning. The odds are very high, too, that these tokens are being burned on one of the frontier cloud services.

The thing is, AI users are on the verge of reconsidering local AI models (whether they know it yet or not). Lots of people are experimenting on the bleeding edges of local, but I think we’ll see businesses and heavy users shift as well, for the simple reason that you aren’t being pistol-whipped for token money when you’re running locally.

Lobster also optional

Now then: you don’t need OpenClaw to run a local model. You need a tool that loads and manages the parameter set that constitutes an LLM. That is, it runs models for you.

So my approach has been to load up a mini PC (beside me here at my desk–contributing to the general clutter, alas), first with the Ubuntu Linux operating system (the latest LTS version), then with Ollama, then with an initial model, Qwen3 running a 7 billion parameter model. (See the Lab Notes at PeakZebra.com for more on this setup).

Ollama, for the unitiated, is a harness for running a number of different models. It downloads models from the interwebs and then gives you a command-line chat interface to interact with them. In other words, it doesn’t do any AI itself, it just provides a way to manage and use different models.

This begs the question of what a model actually is, and I’m largely not going to spell that out here because it’s a whole engineering degree’s worth of conversation, but there are a few key elements that are worth teasing out for our purposes.

Things now get a little heavy

A modern AI “model” is, at its core, a giant mathematical function. If you’re not a mathematical mind, you could be forgetting what a function is, but this part is simple. A function takes a range of inputs and maps them in some way onto specific outputs. In math, the mapping is performed by a function, like y=2x. So if the x (input) is 7, the mapped output (y) is 49.

In AI-land, you give the function input — words, images, code, audio — and it transforms that input through many layers of calculations into a prediction about what should come next. In the case of a language model, the simplest framing is: given all the text so far, predict the next token (a token being roughly a word fragment).

The remarkable thing is not the basic objective, but the scale. Parameters are simply numbers — indications of the strength (weight) of relatedness of one word part to another inside the neural network — that determine how strongly different patterns influence each other. During training, the model starts with mostly random parameter values. It is then shown enormous quantities of text and repeatedly makes predictions. Each time it gets something wrong, an optimization process called gradient descent nudges the parameters very slightly in directions that reduce future error. Over trillions of examples, those tiny adjustments accumulate into an internal statistical map of language, concepts, reasoning patterns, styles, facts, and associations.

As an aside: it’s absolutely batshit nuts how this works. No one knows why setting up a sufficiently large collection of parameters causes the output to start to reflect normal conversation and what appears to be reasoning. It’s just the size and the associations, with all sorts of crazy resultant suggestions about what is perhaps going on inside a human mind. More about parameters below in the explainer section.

Smaller models are cheaper and faster, making them useful for routing tasks, classification, summarization, or lightweight assistants. Larger models are usually reserved for harder reasoning problems or higher-quality generation.

Well all right then

To get (finally) down to something practical, task number one for me, once I had a model running on my mini pc, was to write a bit of code that would send a request to that model and have it evaluate and classify the request, handing the classification (its “answer”) back to my code.

The world of AI is primarily a world written in the Python programming language, which is not something I’ve worked with in the past. I’m going to have to get up to speed with it, I’m sure, but in the interest of moving forward in the short run, and because all I’m doing here is making a web (HTTP) request, I decided to write the thing in JavaScript and run it using Node. If this paragraph says nothing to you, it doesn’t matter.

I used AI to write the code, not surprisingly, though I went over it pretty cl

osely. My programming tool of choice these days is Cursor, though my general impression is that all the options are pretty darned good these days and it doesn’t much matter what you pick. Before starting in with Cursor, I used Claude to create a prompt file for this programming task, which brings me to a bit of advice.

If you want something a little complicated to be done right, it makes a lot of sense to ask AI to draft your prompt for you. Reviewing it will help you clarify what you want, and the instructions in the prompt are apt to be a whole lot more concise and accurate, with less room for misinterpretation. If you want a look at the prompt, it’s in the Field Notes at peakzebra.com.

A Quick Note about the Lay of the Land

I’m doing my AI exploration and building in public, so maybe a quick word about how the “in public” part is actually structured. What I think of as the key piece that keeps other things meaningfully oriented to each other is this newsletter.

In the newsletter, I’m expressly trying to keep the focus at a high enough level that I don’t get bogged down in installation instructions and configuration tweaks. The idea is that a broad range of people can get something useful from the newsletter, even if they aren’t especially technical.

Much of the day-to-day “this worked and this blew up” sort of note-making is appearing on Substack in the form of notes. In theory I’m also generating tweets on X, but I haven’t really figured out to make that part of my process without wasting huge amounts of time inadvertantly doomscrolling, so tweets are a work in progress.

For things where it makes sense to share a detailed LLM prompt, specific configuration details, javascript and python code, and whatever else only people who are actually playing along at home are likely to be interested in, that lives in the form of “field notes” on my web site, PeakZebra.com. I’m just getting really rolling on that, so expect a number of things to appear there in the next couple of weeks.

I’ll be posting other things, plus an archive of the newsletter, at PeakZebra.com as well.

Questions, comments, and suggestions always welcome. And I’m grateful if you forward this to someone you think might find it useful.


The Explainer Section

I would very much like for this project to be useful to people who haven’t really dived into some or all of the technologies I’ll be working with and writing about, so I’m thinking I’ll have a section (this section) to provide background on things that have appeared in the current newsletter issue.

Covered last week: Markdown, local models, OpenClaw

Weighted parameters

When people talk about “7B,” “14B,” or “70B” models, the “B” stands for billions of parameters. A 7B model has roughly seven billion adjustable weights; a 70B model has roughly seventy billion.

In general, larger models can capture more complex relationships and tend to reason better, follow instructions more reliably, and retain more nuanced knowledge — but they also require dramatically more memory and computing power. A 7B model may run comfortably on a consumer machine and respond quickly, while a 70B model often requires serious GPU hardware or distributed infrastructure.

Frontier models from companies like OpenAI, Anthropic, Google, and Meta are believed to operate at scales far beyond the openly released 70B-class models, with estimates ranging into the hundreds of billions or even trillions of parameters in various mixtures-of-experts architectures. At that size, training runs can require tens of thousands of GPUs, enormous datacenters, and budgets measured in hundreds of millions of dollars.

Quantized models

The versions people actually download and run locally are often compressed or “quantized.” Quantization reduces the precision of the numbers used for the parameters — for example, storing them in 4-bit or 8-bit form instead of higher-precision floating point values. This dramatically shrinks memory requirements and makes it possible to run surprisingly capable models on consumer hardware, though usually with some tradeoff in quality, reasoning depth, or speed. A full training model may therefore exist in a massive, expensive form inside a datacenter, while the downloadable version is a stripped-down approximation optimized to fit on a desktop GPU or even a laptop.

Accepting the mystery

To some extent, the way this works run counter to typical intuition. The gene for blue eyes may be readily identifiable, but large model parameters do not correspond neatly to explicit rules or stored sentences. There is no single “Shakespeare parameter” or “tax law parameter.” Knowledge and behavior are distributed across the network as overlapping patterns of numerical relationships. The training process effectively compresses patterns from vast amounts of human-produced data into these numerical weights. The finished model is therefore less like a database and more like a highly compressed probabilistic simulator of patterns found in its training experience.