Available for new roles

Kshitij
Mishra

AI Quality Analyst | LLM Evaluation | Data Analyst | Prompt Engineer

AI Quality Analyst at Tavus AI, evaluating production LLM and voice AI systems. Reduced hallucination rates by ~30%, optimized 10+ retrieval workflows, and cut review turnaround by 20%.

Download Resume Get in touch LinkedIn GitHub

10+ Retrieval workflows

500+ Outputs / month

30% Hallucination reduction

20% Faster turnaround

Core stack

Python SQL LLM Evaluation Prompt Engineering Pandas NumPy Power BI Whisper Google STT ElevenLabs Jira GitHub

01 — About

I make AI systems
more reliable.

I'm an AI Quality Analyst based in Noida, India. My day-to-day at Tavus AI involves evaluating production LLM and voice AI outputs, building structured feedback loops, and designing test frameworks that catch problems before they reach users.

Proficient in Python and SQL for data analysis, fluent in LLM evaluation methodology, and experienced with voice AI tooling including Whisper, Google STT, and ElevenLabs. Outside core work I run a parallel track in quantitative crypto market analysis — studying volatility regimes and market microstructure.

I also maintain kshitij.info — a personal portfolio and engineering blog where I regularly publish technical articles on topics from Kalman filters to crypto microstructure and data pipeline design.

Seeking roles in AI/ML engineering, data analysis, QA testing, or prompt engineering where structured evaluation thinking matters.

Currently upskilling

Machine Learning Specialisation — Andrew Ng (Coursera)
Scikit-learn & Deep Learning
PyTorch fundamentals
Kaggle competitions

Education

B.Tech — Electrical & Electronics Engineering

Dr. A.P.J. Abdul Kalam Technical University

2017 – 2021 · 75.6%

Tools

Power BI Jira Notion GitHub

02 — Experience

Work history

AI Quality Associate

Tavus AI

Current

Jan 2024 – Present

Evaluated 500+ AI-generated voice and video outputs per month, maintaining production quality standards across live workflows.
Reduced model hallucination rate by ~30% by designing structured feedback reports delivered to engineering and research teams.
Built and optimised LLM prompt test suites, improving output consistency and correctness across 10+ retrieval workflows.
Developed scoring rubrics and benchmarking criteria to compare prompt variants and track output degradation over time.
Supported deployment readiness of 5+ AI features by validating performance across real-world multi-scenario test cases.
Streamlined evaluation pipelines with cross-functional teams, cutting review turnaround time by 20%.

LLM Evaluation Prompt Testing Python Voice AI Hallucination Detection

Relationship Manager

Urban Company

Aug 2022 – Sep 2023

Analyzed customer behavior and partner performance data to identify service gaps, improving efficiency by 15%.
Maintained structured weekly performance reports used by operations leadership for data-driven decision-making.
Optimised partner onboarding workflow using data insights, reducing average onboarding time by 10%.

Data Analysis Reporting Operations

03 — Projects

Things I've built

↗

AI Voice & Output
Evaluation Framework

Designed a production-style evaluation system for LLM and voice AI outputs with structured scoring for hallucination detection, logical consistency, and response quality. Built reusable evaluation pipelines and scoring frameworks to improve output reliability and consistency across testing workflows.

Live Demo Code →

Python LLM Evaluation Prompt Testing Workflow Validation

↗

Crypto Market
Analysis System

Built a Streamlit-based crypto analytics system to track volatility regimes, liquidity conditions, and trend structure using time-series analysis and Python data pipelines. Added statistical indicators and visualization layers to turn raw market data into usable insights.

Live Demo Code →

PythonStreamlitPandasSQLData Viz

↗

User Failure &
Retention Dashboard

Built a data-driven dashboard to analyze user failure patterns and their impact on retention in AI-based workflows. Tracked retry cycles, drop-off points, and cohort retention across 500 users to identify key bottlenecks affecting user success and experience.

Live Demo Code →

Python Streamlit Pandas Seaborn Data Analysis

↗