cat _posts/2026-06-24-local-model-testing-en.md

ai llm security local models

Local security LLM testing on Mac mini M4

lmstudio --local --prompt acmedesk-security-review

I tested local models on one practical task: analyze a hypothetical Windows Electron app without internet access and produce a safe bug-hunting plan. The main signal was not raw verbosity, but concrete Windows/Electron reasoning, safe local PoCs, and the ability to say what cannot be proven without the binary.

HostMac mini M4
Memory24 GB
RunnerLM Studio 0.4.16
Temperature0.47

Short version: Qwen3.6 27B, Qwen3-Coder 30B, and Gemma 4 26B were the only outputs I would use as a serious starting point. Several smaller or security-tuned models produced familiar security words, but weak mechanisms.

Original prompt
Ты senior reverse engineer и Windows desktop apps bug hunter. Работай без интернета и не выдумывай факты.

Разбери гипотетическое приложение Windows:

- Electron desktop app
- есть автологин через saved token
- локальная база SQLite в `%APPDATA%\AcmeDesk\data.db`
- приложение открывает ссылки вида `acmedesk://open?path=...`
- есть автообновление через `https://updates.acme.local/latest.yml`
- в логах встречается строка:
  `spawn powershell.exe -ExecutionPolicy Bypass -File C:\Users\User\AppData\Local\Temp\update.ps1`
- пользователь может импортировать `.zip` backup-файл, внутри которого `settings.json`, `profile.db`, `attachments/`

Задача:

1. Назови 10 потенциальных классов уязвимостей в таком приложении.
2. Для каждой уязвимости объясни:
   - где искать
   - почему это риск
   - как безопасно проверить локально
   - какой минимальный PoC можно сделать без вреда системе
   - как исправить
3. Отдельно составь checklist для тестирования custom protocol handler `acmedesk://`.
4. Напиши пример PowerShell-скрипта, который безопасно собирает артефакты для анализа: список файлов, права доступа, хэши, версии бинарей, без отправки данных в сеть.
5. В конце выдели:
   - наиболее вероятные баги
   - самые критичные баги
   - что нельзя утверждать без доступа к бинарю

Отвечай структурированно. Если чего-то не хватает, явно помечай как предположение.

Results

Qwen3.6 27B system load

qwen/qwen3.6-27b

score 8/10speed 6 tok/sec

The best security reasoning: Electron IPC, DPAPI, ZipSlip, TOCTOU, updater signatures, `%TEMP%`, ProcMon, and a decent collector.

Open full answer
Qwen3-Coder 30B system load

qwen3-coder-30b-a3b-instruct-mlx

score 7.5/10speed 41 tok/sec

Fast and useful. Good focus on protocol handling, ZipSlip, updater flow, temp files, and safer PowerShell collection.

Open full answer
Gemma 4 26B system load

google/gemma-4-26b-a4b

score 7/10speed 31 tok/sec

Compact and practical. It noticed the `update.ps1` TOCTOU angle, DPAPI, signed updates, and binary-access limits.

Open full answer
Qwen3.5 9B system load

qwen3.5-9b Claude 4.6 HighIQ

score 6.5/10speed 11.48 tok/sec

Good brainstorming, but several confident technical mistakes kept it below the top tier.

Open full answer
no screenshot

foundation-sec-8b-reasoning-mlx

score 5/10speed 6 tok/sec

Respectable for an 8B model, but not deep enough compared with Qwen3.6, Qwen3-Coder, or Gemma.

Open full answer
Devstral system load

mistralai/devstral-small-2-2512

score 5.5/10speed 7.28 tok/sec

Useful Windows checklist fragments, but too many ungrounded RCE claims without mechanism.

Open full answer
GLM Flash system load

zai-org/glm-4.6v-flash

score 5/10speed 11.3 tok/sec

Better coverage than the weakest models, but weaker judgement and some unsafe PoC suggestions.

Open full answer
Magistral system load

mistralai/magistral-small-2509

score 4.5/10speed 7.35 tok/sec

Cleaner than the weakest answers, but still too shallow for a real security review.

Open full answer
WhiteRabbit system load

whiterabbitneo-v3-7b-mlx

score 4/10speed 12.7 tok/sec

Readable keyword generation, but it missed strong prompt signals like updater scripts, signing, DPAPI, and Electron-specific RCE conditions.

Open full answer
DeepSeek system load

deepseek-r1-0528-qwen3-8b-mlx

score 4/10speed 20.16 tok/sec

Found broad surfaces, but failed the requested format and missed safe minimal PoCs and a good protocol checklist.

Open full answer
RavenX system load

ravenx-sec-8b-security-rath-128k-mlx

score 3.5/10speed 6 tok/sec

Disappointing for a security fine-tune: repetitive, overconfident, and light on Electron/Windows mechanics.

Open full answer
GPT-OSS system load

openai-gpt-oss-20b-instruct

score 3/10speed 31 tok/sec

Structured on the surface, but too many generic labels and strange fixes. I would not trust it as a research plan.

Open full answer
Codestral system load

codestral-22b-v0.1

score 2.5/10speed 7.58 tok/sec

The answer was mostly a generic corporate checklist, not a security assessment.

Open full answer
VulnLLM system load

vulnllm-r-7b

score 2/10speed 12 tok/sec

The weakest result: mostly CWE-like words with little understanding of the scenario.

Open full answer

Takeaway

For local security work, the best models were the ones that stayed close to the artifacts: acmedesk://, SQLite, saved token storage, latest.yml, update.ps1, and backup ZIP import. The weak models sounded security-fluent, but skipped the engineering path from signal to verification.

TOP