<- MENIVA.NET
MENIVA / ENG / LOGBOOKVOL. 01 - 04 ENTRIESRELEASE DOSSIER

What we
actually
shipped.

Four systems, in production. This is the engineering logbook: the same dossier we hand a technical buyer. Business reader: start at the headline. Engineer: inspect the pipeline, expand the topics.

4
Systems
74k+
Indexed
6x4
Eval metrics
100%
Grounded
ENTRY NO. 01Hospitality / F&B AnalyticsLive deliveryRAG / Semantic SearchLIVE

TasteTrend Analytics

Identity · Retrieval intelligence

RAG-powered intelligence layer over 74k+ restaurant reviews across 200+ locations

Problem

74k+ reviews accumulated across Google, TripAdvisor, and internal channels — no queryable layer.

System Flow
6 stages · Review Ingestion -> Grounded Answer · RAG / Semantic Search
->
->
->
->
->
Click any stage to inspect
Signature Module · Retrieval intelligence

Inspect retrieval

RAG / Semantic Search

Watch one question move through the retrieval stack: query, ranked chunks, citations, and grounded answer.

Retrieval Trace · System Output● live
QUERY"top service complaints Q4 downtown"
↓ embed · text-embedding-3-small· 1,536-dim · ~80ms
ANNFAISS IVF100 · top-20 candidates· 40ms
↓ BGE cross-encoder rerank· 180ms
TOP 5├─ Rev #47821 "Slow service at..." 0.94
├─ Rev #12903 "Staff was unresponsive..." 0.87
└─ Rev #29841 "Waited 40 min for..." 0.81
↓ GPT-4o · citation enforcement· ~1.4s
OUTAnswer grounded [¹][²][³] — 87% faithful

Business Impact

Outcomes
Before

Analysis time per cycle: 3–5 weeks

->
After

Analysis time per cycle: < 4 hours

The review corpus became a queryable operational asset. Analysis time dropped from multi-week sprints to sub-four-hour sessions. Every generated insight is traceable to a specific cited review. The team now runs ad-hoc queries — "What are the top service complaints in Q4 across downtown locations?" — and receives grounded answers in seconds.

52×
Faster review analysis
200+
Restaurant locations indexed
74k+
Reviews in active corpus

Engineering Evaluation

Nightly eval
81%
Recall@5
87%
Grounding Rate
95%
Dedup Coverage
84%
Query Answer Relevancy
Headline business result52×
Faster review analysis

Why This Is Hard

4 engineering challenges
Challenge · 01

Hallucination on multi-location queries

when a query spans ambiguous location signals, the LLM blended review content across sites. Solved by injecting explicit location metadata as context anchors in the system prompt and requiring per-citation location attribution in the output schema.

Challenge · 02

Near-duplicate review noise at scale

chain-level copy-paste reviews (same text across multiple franchise locations) inflated the corpus without adding information. MinHash LSH at a 0.85 Jaccard threshold, applied at ingestion, removed ~12% of raw corpus volume as near-duplicates.

Challenge · 03

Incremental index updates without full reindex

FAISS flat indexes do not support in-place deletion. Solved via an ID map append-and-compact strategy: new embeddings appended to a staging index, merged into main during off-peak compaction triggered when delta exceeds 5% of total.

Challenge · 04

Multilingual corpus (EN/HU/DE)

text-embedding-3-small outperformed multilingual-e5-large and mE5 on mixed-language retrieval recall in internal benchmarks on this specific corpus. Language detection is applied at ingestion to enable language-band filtering at retrieval time.

Engineering Depth

5 topics · click to expand

Two-stage retrieval: FAISS IVF100 flat index returns top-20 ANN candidates in ~40ms, followed by BGE-reranker-large cross-encoder scoring each candidate against the full query text. Top-5 advance to generation. The reranker adds ~180ms latency but improved Recall@5 by 11 pp over retrieve-only in internal A/B evaluation on 150 labelled queries. Metadata pre-filters on location/date run before ANN, reducing search space by up to 60% on scoped queries.

Key numbers
312kTotal indexed chunks
81%Recall@5
Built withRAGSemantic SearchFAISSGPT-4oAWS
ENTRY NO. 02SaaS / B2B Revenue OperationsLive deliveryLead Scoring / Feature EngineeringLIVE

Revon

Identity · Revenue signal engine

AI revenue signal engine — lead scoring, prioritisation, and next-best-action from inbound noise

Problem

No scoring layer — inbound lead triage relied entirely on manual gut-feel review.

System Flow
6 stages · Inbound Signal Collection -> Next-Best-Action · Lead Scoring / Feature Engineering
->
->
->
->
->
Click any stage to inspect
Signature Module · Revenue signal engine

Inspect scoring

Lead Scoring / Feature Engineering

See how an inbound lead is decomposed into features, scored, ranked, and handed off with reasoning attached.

Lead Scoring Trace · Prototype Benchmark● live
INPUTAcme Corp · Series B · 85 employees
↓ Clearbit enrich + LLM extraction· 1.2s
FEATSICP alignment ·················· 0.82
Intent keyword density ·········· 0.71
Tech stack overlap ·············· 0.68
Hiring signal (RevOps) ·········· HIGH
↓ XGBoost · 43 features· <5ms inference
SCORE0.74 → HOT TIER
↓ SHAP explanation + outreach draft
ROUTEAE notified via Slack · CRM updated· 1.8s total

Business Impact

Outcomes
Before

Lead triage method: Manual gut-feel review

->
After

Lead triage method: Calibrated ML score + LLM signals

Inbound leads are scored, ranked, and routed automatically. High-intent signals surface to reps within minutes with a structured qualification card. The modelled pipeline opportunity uplift is €850k based on historical conversion rates applied to improved triage. Manual review overhead is estimated at 11h/week per rep — time redirected to active selling. Both figures are modelled estimates, not measured outcomes.

€850k
Modelled pipeline uplift
11h/wk
Manual review reduction
78%
Top-3 prioritisation accuracy

Engineering Evaluation

Nightly eval
78%
Top-3 Prioritisation Accuracy
91%
Enrichment Completeness
89%
Score Consistency
82%
SHAP Signal Accuracy
Headline business result€850k
Modelled pipeline uplift

Why This Is Hard

4 engineering challenges
Challenge · 01

Feature distribution shift across lead sources

LinkedIn and form leads have systematically different signal distributions. Solved via source-specific feature scaling and a source-channel indicator feature that lets the model adapt to distribution differences without separate per-source models.

Challenge · 02

LLM semantic score calibration

raw GPT-4o-mini scores clustered near 0.5 without explicit calibration. Resolved by prompting the model to use the full 0–1 scale with anchor examples at each decile, followed by a distribution normalisation step to enforce uniform spread across the scoring output.

Challenge · 03

Recency bias in training data

the model initially over-weighted recently-contacted leads due to label timing effects in the CRM data. Fixed via time-normalised feature engineering and a temporal train/test split that evaluates on a future cohort, not a random split of the same period.

Challenge · 04

Sparse data for early-stage startups

~18% of inbound have no Clearbit coverage. LLM web extraction fills 73% of gaps, but the remaining 27% degrade enrichment quality. A completeness score gate (minimum 5 core fields) flags low-coverage records before scoring and includes a data confidence label in the qualification card.

Engineering Depth

5 topics · click to expand

XGBoost binary classifier trained on CRM outcome data with a temporal train/test split. Features: 60% structured firmographic signals, 40% LLM-extracted semantic signals. SHAP TreeExplainer computes per-prediction feature contributions displayed on AE qualification cards. Model is designed for monthly retraining on a rolling 18-month window — retraining is triggered automatically when feature drift metrics exceed threshold.

Key numbers
1.8sAvg inference latency
43Total input features
Built withLead ScoringFeature EngineeringXGBoostLLM FeaturesRevOps
ENTRY NO. 03EdTech / Professional DevelopmentLive deliveryAdaptive Learning / RAGLIVE

Nullfall

Identity · Adaptive learning engine

AI-assisted adaptive learning platform for ML/AI engineers — RAG-grounded explanations, competency progression, spaced repetition

Problem

Fixed course sequences delivered identically to every learner — no personalisation.

System Flow
7 stages · Placement Assessment -> Competency Progression · Adaptive Learning / RAG
->
->
->
->
->
->
Click any stage to inspect
Signature Module · Adaptive learning engine

Inspect adaptive engine

Adaptive Learning / RAG

Replay how learner state, competency gaps, and spaced repetition determine the next learning action.

Adaptive Path Trace · Prototype● live
ASSESSCAT diagnostic · θ = 1.24 (above avg)
Converged in 17 items · SE < 0.3
GAPS├─ Transformer attention ········ 0.42
└─ Positional encoding ·········· 0.61
↓ RAG explanation · 768-dim · top-k 8
QUIZ3 items · judge-validated · ~9% rejection
Mastery check: ✗ attention gate
MICROTargeted microdrill · 40 XP · 4 min
SRSNext review: +1d · ease factor 2.5

Business Impact

Outcomes
Before

Curriculum path: Fixed sequence, identical for all learners

->
After

Curriculum path: Personalised adaptive path per placement result

Learners receive paths matched to their demonstrated competency with RAG-grounded explanations and targeted microdrills when mastery checks fail. Explicit XP mechanics (500 XP/level, 40 XP/microdrill, 50 XP/quest) create visible progress milestones. The target metric is a 68% improvement in quiz completion rates vs a fixed-path baseline — not yet measured at scale.

68%
Quiz completion increase
500 XP
Per level milestone
50 XP
Per quest completed

Engineering Evaluation

Nightly eval
68%
Quiz Completion Rate Improvement
84%
Assessment Reliability (Cronbach α)
91%
Quiz Generation Pass Rate
86%
RAG Explanation Grounding
Headline business result68%
Quiz completion increase

Why This Is Hard

4 engineering challenges
Challenge · 01

Cold-start ability estimation

IRT CAT must converge on a reliable ability estimate in 15–25 items. Each item is selected to maximise Fisher information at the current θ estimate, making the diagnostic efficient. Learners who skip the diagnostic start at θ=0 (population mean) and the estimate updates rapidly from the first 5 interaction events.

Challenge · 02

Knowledge graph sparsity in advanced subgraphs

some advanced competency nodes have few explicitly defined prerequisite edges, creating sequencing ambiguity. Resolved by inferring implicit co-prerequisites from learner performance patterns across the cohort: nodes that are consistently failed together are treated as implicit prerequisites.

Challenge · 03

LLM quiz quality control

initial GPT-4o quiz generation had a ~12% ambiguity or answer-leakage rate without a validation layer. A GPT-4o-mini judge with a structured rubric reduced this to ~9%. Persistent failures on the same objective fall back to a curated question pool.

Challenge · 04

XP calibration

the XP system required calibration so that one level (500 XP) corresponds to approximately 8–12 hours of engaged learning. Calibrated via content time estimates (microdrills ~4 min, quests ~25 min) and adjusted against a pilot cohort to match realistic learner pacing.

Engineering Depth

5 topics · click to expand

Course material is chunked into 1,200-character sentence-boundary units and embedded at 768 dimensions. At explanation time: the learner's detected misconception + learning objective is embedded, top-k 8 chunks retrieved from FAISS, passed to GPT-4o with a grounding requirement. The system prompt explicitly states that every claim must be supported by the retrieved context. Grounding rate: ~86% (RAGAS faithfulness, internal eval on 100 explanation samples — prototype benchmark).

Key numbers
768-dimEmbedding dimension
1,200Avg chunk size (chars)
Built withAdaptive LearningRAGIRTSpaced RepetitionCompetency Systems
ENTRY NO. 04Sales Technology / GTM EngineeringLive deliveryAgentic Workflows / Browser AutomationLIVE

Scoutbound

Identity · Agentic workflow

Agentic prospecting workflow — website inspection, structured extraction, ICP scoring, and CRM-ready export

Problem

Manual tab-by-tab website inspection — no reproducible research methodology.

System Flow
7 stages · Company List Input -> CRM-Ready Export · Agentic Workflows / Browser Automation
->
->
->
->
->
->
Click any stage to inspect
Signature Module · Agentic workflow

Replay workflow

Agentic Workflows / Browser Automation

Step through one company workflow: browser inspection, extraction, scoring, reasoning, and CRM-ready export.

Workflow Trace · acme-corp.com● live
INPUTdomain: acme-corp.com · pre-flight ✓
INSPHomepage + About + Careers· 52s
Full JS render · Playwright Chromium
EXTR7/8 fields extracted · conf: HIGH
Category: "Revenue operations SaaS"
Signal: hiring RevOps engineer ★
↓ ICP rubric · 5 dimensions· 1.2s
SCORE82 / 100 · STRONG FIT
OUT→ HubSpot · 91% completeness· 2m 14s total

Business Impact

Outcomes
Before

Research method: Manual tab-by-tab web inspection

->
After

Research method: Playwright browser agent per company

The full prospect research workflow — from company list to enriched, scored, CRM-ready profiles — runs automatically. Key metrics are prototype benchmarks from controlled evaluation: 4× speed improvement, 83% extraction consistency, 71% lead relevance precision, 91% CRM field completeness. These are not yet measured against tracked production outcomes.

Faster prospect research
91%
CRM field completeness
83%
Extraction consistency

Engineering Evaluation

Nightly eval
83%
Extraction Consistency
71%
Lead Relevance Precision
91%
CRM Handoff Completeness
94%
Workflow Completion Rate
Headline business result
Faster prospect research

Why This Is Hard

4 engineering challenges
Challenge · 01

JavaScript-heavy SPAs break naive HTTP scraping

Playwright's full Chromium rendering captures JavaScript-executed content. Pages relying on authentication walls or heavy AJAX are flagged as partial-extraction, with a manual-review recommendation included in the export.

Challenge · 02

Extraction schema drift

company websites have wildly different HTML structures. LLM-based extraction handles structural variation, but confidence calibration is critical: fields with LLM confidence below threshold are flagged in the review_needed column rather than silently included in the CRM export.

Challenge · 03

Workflow timeout and partial failure handling

a 90-second per-company timeout prevents runaway browser sessions. Partial results (some pages inspected, not all) are saved per company; the export includes an extraction completeness score per record so users know what was captured vs what needs manual review.

Challenge · 04

ICP scoring consistency across a batch

GPT-4o-mini scoring was inconsistent on similar companies when prompted per-company. Stabilised by generating the scoring rubric once at workflow start (not per company) and applying it identically across all companies in the batch.

Engineering Depth

5 topics · click to expand

Playwright runs in async mode with configurable concurrency (default: 3 parallel browser contexts per run). Each context handles one company: homepage + About + up to 2 additional pages. A Redis-backed task queue manages company dispatch. Per-page timeout: 30s. Per-company timeout: 90s. Failed pages retry once with a longer wait (45s). Partial results are preserved for companies where at least one page succeeded.

Key numbers
2m14sAvg workflow completion
83%Extraction consistency
Built withAgentic WorkflowsBrowser AutomationStructured ExtractionICP ScoringCRM
ENTRY NO. 05STATUS: UNWRITTENYOUR SYSTEM

Your
system,
next.

We take on a small number of engagements per quarter. Bring the problem and the success metric; we will bring the engineering.