In Progress (Targeting ICML 2026)
Hierarchical RL User Simulation Improves Alignment by 36%
Khatua, A.*, Wu, S.*, Choi, E.*, Wang, H., He-Yueva, J., Weerasooriya, C., Wei, W., Yang, D., Leskovec, J., & Zou, J.*
HumanLM introduces hierarchical RL modules for user simulation, with initial results showing improved alignment with real user responses. The system raises LLM-judge similarity by 36% over SFT baselines through multi-level generation training and carefully designed stance and style reward functions.
Paper:
HumanLM: Building Digital Humans from Large Language Models
In Progress (Targeting ICML 2026)
Multilingual SWE-smith Scales Bug Generation Across Programming Languages
Khatua, A.*, Li, X.*, Shethia, P.*, Li, Z.*, & Yang, J.
Extended SWE-smith by automating test environments construction for JavaScript/TypeScript, Java, Rust, and C++. The system scales procedural bug generation to any repository, producing mid-training data for improving Multi-SWE-Bench performance across multiple programming languages.
Paper:
Multilingual SWE-smith
EMNLP 2025
Study Reveals 3.3% of Wikipedia Facts Contradict Each Other
Semnani, S. J., Burapacheep, J., Khatua, A., Atchariyachanvanit, T., Wang, Z., & Lam, M. S.
Stanford researchers have discovered that at least 3.3 percent of English Wikipedia facts contradict other information in the corpus. This amounts to millions of inconsistent statements across the encyclopedia. The team introduced CLAIRE, an AI system that helps Wikipedia editors identify 64.7 percent more inconsistencies than traditional methods.
Paper:
Detecting Corpus-Level Knowledge Inconsistencies in Wikipedia with Large Language Models
NeurIPS 2025 Workshop (Oral)
Video Understanding System Achieves 3% Improvement Without Architecture Changes
Durante, Z., Singh, S., Khatua, A., Agarwal, S., Tan, R., Lee, Y. J., Gao, J., Adeli, E., & Fei-Fei, L.
A data-centric approach for efficient video understanding achieved 3 percent absolute improvement on VideoMME benchmarks under identical compute constraints, without requiring architectural modifications. The method splices short captioned videos into synthetic long-context training samples.
Paper:
VideoWeave: A Data-Centric Approach for Efficient Video Understanding
Panasonic
Multi-Agent Video QA System Improves Zero-Shot Performance by 6%
Kugo, N., Li, X., Li, Z., Gupta, A., Khatua, A., Jain, N., Patel, C., Kyuragi, Y., Ishii, Y., Tanabiki, M., Kozuka, K., & Adeli, E.
A multi-agent framework for video question answering with role specialization improved zero-shot performance by up to 6 percent over prior state-of-the-art methods through collaborative reasoning and information aggregation.
Paper:
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering
KDD 2023
Largest Academic Graph Dataset Created: 162× Larger Than Prior Benchmarks
Khatua, A., Mailthody, V., Taleka, B., Song, X., Ma, T., Bigaj, P. & Hwu, W.
Researchers created the Illinois Graph Benchmark (IGB), the largest academic GNN dataset with 260 million nodes, 4 billion edges, and 220 million labels. The dataset is 162 times larger than prior datasets and has been adopted as an MLPerf industry standard for GNN benchmarking.
Paper:
IGB: Addressing The Gaps In Labeling, Features, Heterogeneity, and Size of Public Graph Datasets for Deep Learning Research
CLASSIFIED AD
WANTED
More GPU hours. Will trade sanity. Contact: desperate@stanford.edu
Seeking: A100s, H100s, or any GPU that doesn't crash during training. Willing to negotiate: firstborn child, coffee supply, or eternal gratitude.
Current situation: Running experiments on a potato. Results may vary. Desperation level: Critical.
References available upon request. Previous GPU owners: please don't ask.
—·—· ·—·· ·· —·—· —·— / ——· ·——· ··— / ·— —··
Click for Details