Testing Local LLMs in Practice: Code Generation, Quality vs. Speed¶
Over the past few months, the landscape of open-weight large language models has changed dramatically. New models are being released at a pace that makes systematic evaluation difficult, yet increasingly necessary.
Most LLM comparisons still rely on synthetic benchmarks. While useful, they often fail to answer a more practical question:
How do these models perform in a real, production-like task?
This article presents a structured evaluation of local LLMs based on a concrete, repeatable, and measurable workload: autonomous code generation. Specifically, we focus on how well an autonomous agent, backed by these models, generates production-ready log parsers, and how that performance scales in terms of quality vs. speed.
Log Parsing: The Backbone of SIEM
A SIEM (Security Information and Event Management) system collects and analyzes logs from across an organization’s infrastructure to detect cybersecurity threats.
But before any detection or analysis can happen, logs must be transformed from raw text into structured data. This process - log parsing - is foundational:
- If parsing is wrong, detection is wrong
- If parsing is incomplete, visibility is lost
In practice, parsers are written and maintained by humans. That does not scale:
- Log formats constantly change
- New sources keep appearing
- Quality degrades over time
Log parsing becomes a bottleneck and a silent failure point. This makes it a strong candidate for automation using LLMs, where the task is not just to parse logs, but to generate working parsers automatically.
From “Vibe Coding” to Autonomous Agents¶
There is an important distinction to make upfront. This is not about interactive coding, copilots, or “vibe coding.” Instead, this is a fully autonomous agentic task:
- The agent receives a well-defined objective
- The agent generates complete Go code (a log parser), powered by the model under test
- The generated code must compile and run
- The output is validated against real input data
No human is guiding the process mid-way. No manual corrections. This setup tests something much closer to real automation potential than typical prompt-based evaluations.
Quality vs. Speed: The Core Trade-off¶
Every model we tested falls somewhere on a curve between two extremes:
- Quality-first: with some models, the agent produces more accurate parsers but generates them slowly
- Speed-first: with other models, the agent produces many parsers quickly, often at “good enough” quality
The interesting question is not “which model is best?” but “which model is best for your requirements?”
To make this trade-off tangible, we built an interactive visualization:
How to read it:
- X axis: Speed, in parsers generated per hour. The wall-clock time of failed attempts charged to the successful ones, so a model that fails often is correctly penalised. Faster models are on the right.
- Y axis: Quality, on a
0..100scale that means "% of the best score ever observed for that log category". Each log category has its own scoring scale (fortigateruns can score up to ~75 points,ibm_fsonly up to ~24), so raw points are not comparable across categories. We divide every run's score by the category's per-category maximum and take the mean.100therefore means "as good as anyone has ever been on this category", not "all fields perfect". - Each point: one model evaluated on one testing infrastructure.
A model that has been benchmarked on more than one testing infrastructure appears as multiple points, one per infrastructure. This makes the effect of the deployment context visible at a glance: the same model shifts left or right on the speed axis depending on where it runs, while its quality stays roughly constant.
Qwen 3.6 27B: BF16 vs. FP8
We tested Qwen3.6-27B in both of its original quantization variants — BF16 and FP8 — across the testing infrastructures available to us.
Two observations stand out:
- As expected, quality stays constant across infrastructures while speed varies, only the throughput moves.
- More interestingly, we see no measurable quality gap between BF16 and FP8. The two variants land on top of each other on the quality axis, with FP8 sitting clearly to the right (faster). For this task, the lower-precision weights cost us nothing we can measure.
Together, these results make Qwen3.6-27B-FP8 a particularly strong pick for autonomous log-parser generation: speed that approaches much smaller models, with quality that holds up against models several times its size.
The dataset behind the chart grows as we continuously evaluate more models. The rest of this post walks through the methodology that powers it.
Why This Matters¶
This kind of evaluation is not just academic.
If you want to move from experimentation to production, you need to answer:
- Which model is good enough?
- Which model is fast enough?
- Which model is deployable on my hardware?
In our case, this directly informs:
- Practical approach to on-prem deployments
- GPU sizing decisions
- Model selection for autonomous workflows
How We Tested¶
The rest of this article walks through the methodology behind the upcoming visualization: the task, the metrics, the dataset, the infrastructure, and the models evaluated.
The Task: Generate a Working Log Parser¶
The core task is simple to describe, but non-trivial to execute:
The task
Given a sample of unstructured logs, the agent generates a Go parser that transforms them into structured JSON with a predefined schema.
This introduces multiple layers of difficulty:
- Understanding semi-structured or messy raw logs on the input
- Designing parsing logic
- Producing syntactically correct and idiomatic Go code
- Ensuring the output matches a target schema (field names + types)
Once the agent has generated a parser, the test harness then (1) compiles it, (2) executes it on input data, and (3) evaluates the result.
Diagram: Agent workflow
Metrics: What Does “Good” Look Like?¶
We evaluate models along two primary dimensions:
Quality¶
How much useful information does the parser the agent generates extract?
- Number of correctly extracted fields
- Schema compliance (names, types)
- Semantic correctness
This reflects the agent’s understanding and precision.
Speed¶
How fast can the agent produce working Go log parsers?
- Number of parsers generated per unit of time (hour)
- Stability across repeated runs
This reflects practical usability at scale.
Test Design: Reducing Noise, Measuring Reality¶
Each model is tested under the same conditions:
- 8-hour test window per model
- Multiple iterations to reduce statistical variance
- Parallel execution (e.g., 6 parsers generated simultaneously)
Instead of a single-shot evaluation, the test captures:
- Stability over time
- Variance in output quality
- Ability to sustain throughput under load
Parallelism is intentional; it exposes how well a model performs when pushed toward maximum token throughput, not just single-request latency.
Dataset: Real Logs, Not Synthetic Tasks¶
The evaluation uses real-world log data across multiple domains, including:
- Windows logs
- DNS logs
- Typical network devices
- Enterprise systems
We selected eight distinct log categories, each with:
- Different structure (or lack thereof)
- Different levels of noise
- Different parsing difficulty
Example of the log
Input raw log:
Mar 10 08:04:18 dnsmasq[30249]: cached www.example.com is 192.0.2.25
Parsed log:
{
"@timestamp": "2026-03-10T08:04:18Z",
"dns.question.name": "www.example.com",
"dns.resolved_ip": "192.0.2.25",
"dns.response_code": "NOERROR",
"event.action": "dns_cached",
"message": "cached www.example.com is 192.0.2.25",
"process.name": "dnsmasq",
"process.pid": 30249
}
This ensures the task is:
- Representative
- Non-trivial
- Hard to overfit
Infrastructure: Testing at Scale¶
All models are executed on a 4-node NVIDIA DGX Spark cluster, interconnected via high-speed 100G networking (ConnectX-7), named quatro-dgx-spark.
Diagram: Primary testing infrastructure
Key properties:
- Nearly 512 GB of unified memory
- High interconnect bandwidth
- Support for running large models and parallel workloads
This setup provides:
- Consistent conditions across models
- Enough capacity to test both large models and parallel scaling behavior
An important side effect
All models, including smaller ones, are spread across nodes, effectively increasing their throughput. In other words, even if the model fits into one DGX Spark with 128GB of unified memory, it is run on the whole cluster, which typically means that the model is nearly 4 times faster.
Secondary testing infrastructures¶
Not every model is a clean fit for the primary cluster.
Some are too large for the unified memory of a single DGX Spark and need a different host; others are small enough that running them on the full cluster would mask their real characteristics.
To keep the comparison representative, we run the same evaluation on additional hardware and tag every row in the dataset with a comment field identifying which infrastructure produced it.
The choice of hardware has a direct impact on Speed, but should not affect Quality. The agent generates a parser that either compiles, runs, and extracts the right fields, or it doesn't; regardless of how fast it was produced. Comparing the same model across infrastructures is a good way to verify that assumption holds.
The infrastructures currently represented in the dataset:
quatro-dgx-spark: the primary 4-node DGX Spark cluster described above. The reference setup; most models are tested here.solo-dgx-spark: a single NVIDIA DGX Spark (~128 GB of unified memory). Used for smaller models and to measure the speed-up the full cluster delivers over one node.solo-rtx6000-bw: a single NVIDIA RTX 6000 Blackwell GPU with 96GB VRAM. The card is faster than DGX Spark but has less memory.octo-rtx6000-bw: eight NVIDIA RTX 6000 Blackwell GPUs in a single server. Used for very large models that don't fit elsewhere, and as a sanity check against a fundamentally different memory and interconnect topology.
In the visualization at the top of the article, the infrastructure for each data point is shown next to the model name, and the filter chips let you focus on a single setup or compare them side by side.
Model Selection: Following the Open-Weight Frontier¶
Models were selected based on:
- Recent releases
- Community traction (Hugging Face, Reddit, LinkedIn)
- Practical relevance
The set includes models from multiple ecosystems:
- US-based models such as those from OpenAI, Google and IBM (Granite family)
- Chinese models (e.g., Qwen, DeepSeek, MiniMax)
- European models from Mistral
The goal was not exhaustive coverage, but representative diversity of the current frontier.
Do not see your favorite model?
Let us know if your favorite model is missing. The test is reproducible, and we routinely evaluate and add new models.
We ❤️ Open source
We will try to publish a model configuration on our GitHub soon.
Final Thoughts¶
The current wave (H1 2026) of open-weight models is not just about scale; it’s about practical usability.
What this evaluation shows so far:
- Agentic code generation on the local LLM is already viable. A meaningful fraction of the parsers the agent produces compile, run, and extract real fields on the first attempt, without any human in the loop.
- Model differences are large and measurable. For example, on the same hardware, the dense Qwen3.6-27B clearly out-performs its 35B-A3B MoE sibling.
And perhaps most importantly: the “best” model is not a fixed answer. It depends on how you balance quality, speed, available hardware - and how much variance you can tolerate in the output.
Key takeaways
-
The harness and the prompt matter enormously. Quality can be substantially improved without changing the model, by tightening the prompt, providing better examples, and improving how the harness retries and recovers from errors.
-
Give the agent enough time. When a task is cut off prematurely, the failure rate climbs sharply. A wall-clock timeout hides model capability rather than measuring it.
-
Treat the absolute numbers as indicative, not exact. The big-picture ordering in this evaluation is already clear and actionable, but the standard deviations on the chart are a useful reminder: small gaps in mean quality between two nearby points are within run-to-run variance. More runs will sharpen the picture over time.