Sangdoo Yun

I'm a research director at Naver AI Lab, working on robust and efficient machine learning models, with applications to computer vision, language, and multi-modal models, towards real-world applications.

At Naver, I've worked for network architectures (ReXNet, PiT), training techniques (CutMix, ReLabel, AdamP, KD), and robustness (ReBias, Shortcut learning, Model Stock). I've also participated on Naver's OCR (e.g., CRAFT, STR, Donut), face recognition, and LLMs (e.g., Cream) products.

I received my MS, and PhD in computer vision at Seoul National University in 2013 and 2017, respectively, under supervision of Prof. Jin Young Choi. I received my BS from Seoul National University in 2010.

I'm also an adjunct professor at SNU AI Inst. from Sep 2022, continuing my previous position at SNU CSE Dept (Sep 2021 - Aug 2022). My previous lectures at SNU are available at [Spring 2022], and [Fall 2024].

Email / CV / Google Scholar / Github

News

Apr 2025. I will serve as an Area Chair for NeurIPS 2025
Jan 2025. One paper on Scaling-up Membership Inference Attacks is accepted at NAACL 2025
Jan 2025. One paper on Masking Augmentation for Supervised Learning is accepted at CVPR 2025
Jan 2025. Two papers are accepted at ICLR 2025: Probabilistic CLIP and Dynamic Weight Interpolation.
Jan 2025. I will serve as an Area Chair for ICLR 2025 and CVPR 2025.
Dec 2024. I will co-organize a workshop on Video-Language Models at NeurIPS 2024.

Research

I am interested in training robust, generalizable, and transferable ML models (including vision, language, and multi-modal models) for real-world applications.

	Scaling Up Membership Inference: When and How Attacks Succeed on Large Language Models Haritz Puerto, Martin Gubri, Sangdoo Yun, Seong Joon Oh. NAACL Findings, 2025 arXiv / Code We show that membership inference attacks (MIA) can work on LLMs, but only when applied at scale, specifically at the document level or more.
	Masking meets Supervision: A Strong Learning Alliance Byeongho Heo, Taekyung Kim, Sangdoo Yun, Dongyoon Han. CVPR, 2025 arXiv / Code Maksking augmentation (e.g., dropping out 80% of image tokens) is proven to be effective for self-supervised learning (i.e., MAE), but not in supervised learning due to its unstable training. We introduce MaskSub, that integrates masking augmentation into supervised learning by introducing sub-branch that handles the signals from the masked images.
	Probabilistic Language-Image Pre-Training Sanghyuk Chun, Wonjae Kim, Song Park, Sangdoo Yun, ICLR, 2025 arXiv We propose ProLIP, a Probabilistic Language-Image Pre-Training method.
	DaWin: Training-free Dynamic Weight Interpolation for Robust Adaptation Changdae Oh, Yixuan Li, Kyungwoo Song, Sangdoo Yun*, Dongyoon Han. ICLR, 2025 arXiv / Code Weight averaging/interpolating works well for robust adaptation by combining multiple model's expertise. In this work, we propose a dynamic weight interpolation that adaptively computes per-sample weights using model entropy, without additional training overhead.
	Code-Switching Curriculum Learning for Multilingual Transfer in LLMs Haneul Yoo, Cheonbok Park, Sangdoo Yun, Alice Oh, Hwaran Lee. arXiv, 2024 arXiv Humans mix languages (i.e., code-switching) naturally when learning second languages; inspied by this, we teach LLMs new languages through code-switching data. Our Code-Switching Curriculum Learning (CSCL) boosts cross-lingual skills efficiently, even when training data is scarce.
	A Unified Framework for Motion Reasoning and Generation in Human Interaction Jeongeun Park, Sungjoon Choi, Sangdoo Yun*. arXiv, 2024 arXiv / Project page We introduce VIM, a unified model capable of understanding, generating, and controlling interactive human motions through multi-turn conversational contexts. We also create a new dataset, Inter-MT2*, a large-scale dataset containing diverse, interactive motion instructions, enabling VIM to effectively perform versatile tasks.
	Direct Unlearning Optimization for Robust and Safe Text-to-Image Models Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, Gayoung Lee. NeurIPS, 2024 arXiv / Code Text-to-image models are good, but sometimes they are good at generating harmful images. We introduce DUO, a clever method that safely teaches these models to forget inappropriate content without spoiling their creativity.
	Model Stock: All we need is just a few fine-tuned models Dong-Hwan Jang, Sangdoo Yun, Dongyoon Han. ECCV, 2024 (Oral Presentation) arXiv / Bibtex / Code Averaging many models (e.g., 70 models for Model soups) improves in-distribution and out-of-distribution performance. Our method, Model Stock (an analogy for very efficient version of Model soup), can achieve comparative performance with just a few fine-tuned models. Our secret is in the weight space of fine-tuned models. We find that the models have special geometric characteristics and we derive an efficient way to achieve the effect of averaging many models.
	HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, Sangdoo Yun. ECCV, 2024 (Oral Presentation) arXiv / Bibtex / Code Data quality, especially specificity and clarity, is crucial for self-supervised model training. We introduce Hyperbolic Entailment Filtering (HYPE) for images and texts, with state-of-the-art performance in the DataComp benchmark.
	Rotary Position Embedding for Vision Transformer Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun. ECCV, 2024 arXiv / Bibtex / Code We study the impact of Rotary Position Embedding (RoPE) on vision models, especially Vision Transformers (ViTs). From the original 1D RoPE for text data, we search for practical implementation for 2D vision data. Our analysis shows that RoPE shows remarkable extrapolation performance (e.g., increasing image resolution).
	TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification Martin Gubri, Dennis Thomas Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh. ACL Findings, 2024 arXiv / Bibtex / Code LLM weights are valuable assets, and their usage should be strictly controlled under rules given by the providers, either in open- or closed-source models. For instance, when a new model has came, the provider wants to know whether the model is based on their own LLMs or not. To this end, we propose TRAP, a black-box LLM identification. TRAP asks the model a very specific question (e.g., give a random 4 digit number) that only one company's machine will answer in a certain way.
	Calibrating Large Language Models Using Their Generations Only Dennis Thomas Ulmer, Martin Gubri, Hwaran Lee, Sangdoo Yun, Seong Joon Oh. ACL, 2024 arXiv / Bibtex / Code Our goal is to calibrate the confidence of LLMs to increase the reliability of their outputs. We propose APRICOT, a method to calibrate the confidence of LLMs via utilizing external lightweight model.
(generated by Dall-E)	TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models Jaewoo Ahn, Taehyun Lee, Junyoung Lim, Jin-Hwa Kim, Sangdoo Yun, Hwaran Lee, Gunhee Kim. ACL Findings, 2024 arXiv / Bibtex / Code When interacting with LLM-based chatbot, especially in role-playing scenarios, users expect the model to act as a specific character at a specific time point. We propose TimeChara, a method to evaluate the point-in-time character hallucination of role-playing large language models. TimeChara can evaluate the model's consistency and coherence in the role-playing task.
(generated by Dall-E)	Who Wrote this Code? Watermarking for Code Generation Taehyun Lee, Seokhee Hong, Jaewoo Ahn, Ilgee Hong, Hwaran Lee, Sangdoo Yun, Jamin Shin, Gunhee Kim. ACL, 2024 arXiv / Bibtex / Code We propose SWEET, a watermarking method for code generation LLMs. Previous watermarking methods fail to handle code generation scenarios due to syntactic and semantic characteristics of code. SWEET leverages the entropy of token distribution, and shows strong watermarking performance in code generation.
	Toward Interactive Regional Understanding in Vision-Large Language Models Jungbeom Lee, Sanghyuk Chun, Sangdoo Yun*. NAACL, 2024 arXiv / Bibtex / Code We propose RegionVLM*, a vision-large language model that can interactively attend to regions of interest in images. Region-VLM simply handles the regional infomation (e.g., coordinates) as additional text inputs. Training with Localized Narratives dataset, RegionVLM shows improved performance in regional understanding tasks.
	Language-only Efficient Training of Zero-shot Composed Image Retrieval Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun. CVPR, 2024 arXiv / Bibtex / Code / Demo Do we really need images for training composed image retrieval (CIR) models? Our answer is NO. We propose LinCIR, a language-only training method for CIR. LinCIR shows strong performance with efficeint training cost (e.g., training LinCIR with a CLIP ViT-G backbone in 1 hour).
	Compressed Context Memory for Online Language Model Interaction Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song. ICLR, 2024 arXiv / Bibtex / Code How can we use infinitly long contexts for LLMs? Inspired from the gist token, we propose an online context compression method. Our method can compress accumulated attention KVs into few [comp] tokens with 5x smaller context memory size.
	Prometheus: Inducing Fine-grained Evaluation Capability in Language Models Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, Minjoon Seo ICLR, 2024 arXiv / Bibtex / Code We introduce Prometheus*, a fully open-sourced LLMs with GPT-4 compatible evaluation performance. We built the Feedback Collection dataset, which is also open-sourced, including more than 20K instructions and 100K responses.
	Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models Geewook Kim, Hodong Lee, Daehee Kim, Haeji Jung , Sanghee Park, Yoonsik Kim, Sangdoo Yun, Taeho Kil, Bado Lee, Seunghyun Park EMNLP, 2023 arXiv / Bibtex / Code / Demo After introducing Donut, we build Cream🍦 which leverages large language models (LLMs). To mitigate the gap between vision encoders and LMs, we propose auxiliary encoders and contrastive learning scheme. Cream demonstrates robust and impressive document understanding performance.
	ProPILE: Probing Privacy Leakage in Large Language Models Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, Seong Joon Oh. NeurIPS, 2023 (Spotlight) arXiv / Bibtex / tweet Code / Demo Perhaps, Large language models (LLMs) can answer just about anything, with their hyper-scale parameters and data. However, they may answer your private information (i.e., personally identifiable information (PII)), then it could be problematic. With our probing tool, ProPILE, we can investigate whether the model reveals our personal information or not.
	Neural Relation Graph for Identifying Problematic Data Jang-Hyun Kim, Sangdoo Yun, Hyun Oh Song. NeurIPS, 2023 arXiv / Bibtex / Code Problematic data (e.g., outlier data or incorrect labels) harm model performance and robustness. However, identifying such problematic data in large-scale datasets is quite challenging. Our solution focuses on the relationship among data, particularly in the feature space. By utilizing our relation graph, we can easily determine whether a data point is an outlier, has a misassigned label, or is perfectly fine.
	CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun. Equal contribution TMLR, 2024 arXiv / Bibtex / Code / Demo We propose a diffusion-based model, CompoDiff, for Composed Image Retrieval (CIR) task. To train the model, we created a new dataset comprising 18 million triplets of images and associated conditions. CompoDiff* shows state-of-the-art zero-shot CIR performance.
	Neglected Free Lunch -- Learning Image Classifiers Using Annotation Byproducts Dongyoon Han, Junsuk Choe, Dante Chun, John Joon Young Chung, Minsuk Chang, Sangdoo Yun, Jean Y. Song, Seong Joon Oh. Equal contribution ICCV, 2023 arXiv / cvf / Bibtex / Code & Dataset / Video / slide (@KCCV2024) When annotating data, annotators unintionally generate auxiliary information during the annotation task, such as mouse traces, mouse clicks, time durations. We call them annotation byproducts (AB). We propose the new paradigm of learning using annotation byproducts (LUAB)* which can enhance robustness of image classifiers by aligning them with human recognition mechanisms.
	SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage Song Park, Sanghyuk Chun, Byeongho Heo, Wonjae Kim, Sangdoo Yun. Equal contribution ICCV, 2023 arXiv / cvf / Bibtex / Code Vision deep models are image data hungry, but image storage has become a bottleneck (e.g., LAION-5B images require 240 TB). We propose a storage-efficient training method, SeiT*, that utilizes only 1% of standard pixel storage without sacrificing accuracy.
	MPChat: Towards Multimodal Persona-Grounded Conversation Jaewoo Ahn, Yeda Song, Sangdoo Yun, Gunhee Kim. ACL, 2023 arXiv / Bibtex / Code Building persona is crucial for personalized dialog sistem. We explore additional vision modality beyond text-based persona. To this end, we collect multimodal persona dialog dataset (MPChat) and demonstrate how vision modality help the conversation.
	What Do Self-Supervised Vision Transformers Learn? Namuk Park, Wonjae Kim, Byeongho Heo, Taekyung Kim, Sangdoo Yun. ICLR, 2023 OpenReview / Poster / Slide / arXiv / Bibtex / Code What are the differences between contrastive learning (CL) and masked image modeling (MIM)? Our findings indicate that: (1) CL captures global patterns more effectively than MIM, (2) CL learns shape-oriented features while MIM focuses on texture-oriented features, and (3) CL plays a key role in later layers, whereas MIM is more concentrated on early layers.
	Exploring Temporally Dynamic Data Augmentation for Video Recognition Taeoh Kim, Jinhyung Kim, Minho Shim, Sangdoo Yun, Myunggu Kang, Dongyoon Wee, Sangyoun Lee. ICLR, 2023 (Notable Top 25%) OpenReview / arXiv / Bibtex We introduce DynaAugment, a new video data augmentation to capture the temporal dynamics in videos. DynaAugment changes the magnitude of augmentation operation over time to emulate temporal dynamics found in real-world videos.
	A Unified Analysis of Mixed Sample Data Augmentation: A Loss Function Perspective Chanwoo Park, Sangdoo Yun, Sanghyuk Chun. Equal contribution NeurIPS*, 2022 OpenReview / arXiv / Poster / Bibtex / Code Mixed sample data augmentation (MSDA), such as mixup and CutMix, has become a de facto strategy, but its understanding is not studied deeply yet. We introduce the first unified theoretical analysis for MSDAs and figure out what is the difference between mixup and CutMix. Up on the analysis, we build a simple hybrid version of mixup and CutMix to leverage the advantages of mixup and CutMix.
	Donut 🍩: Document Understanding Transformer without OCR Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park ECCV, 2022 arXiv / Bibtex / Code Current visual document understanding (VDU) models heavily rely on external OCR framework (e.g., text detection, text recognition). OCR is expensive and sometimes not available. We bravely remove the dependency of OCR by modeling a simple transformer architecture. Take our highly efficient and powerful VDU model, Donut 🍩!
	Dataset Condensation via Efficient Synthetic-Data Parameterization Jang-Hyun Kim, Jinuk Kim, Seong Joon Oh, Sangdoo Yun, Hwanjun Song, Joonhyun Jeong, Jung-Woo Ha, Hyun Oh Song. ICML, 2022 Bibtex / Code Data condensation is a trick to compress training data by synthesizing them into several images. The goal is to obtain higher performance with lower consumption of data storage. We propose practical tricks for data condensation to bring it into more practical real-world settings (e.g., 224x224 size with ImageNet) beyond previous toy-ish settings (e.g., 32x32 size with CIFARs).
	Dataset Condensation with Contrastive Signals Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, Sungroh Yoon. ICML, 2022 Bibtex Existing data condensation methods deal with class-wise gradients and ignore the inter-class information. We show it would degrade performance in practical scenarios like fine-grained classification. Our simple remedy is modifying the loss function to integrate contrastive signals, which shows effectiveness in several practical scenarios.
	Weakly Supervised Semantic Segmentation using Out-of-Distribution Data Jungbeom Lee, Seong Joon Oh, Sangdoo Yun, Junsuk Choe, Eunji Kim, Sungroh Yoon. CVPR, 2022 Bibtex / Code Weakly supervised semantic segmentation (WSSS) suffers from spurious correlations between foreground (e.g., train) and background (e.g., rail). Our idea is to collect background images without any foreground pixels (e.g., collecting railroad images without trains). Then we teach the model not to see the background pixels to classify foreground class. Adding small amount of background images brings large performance gain in WSSS.
	The Majority Can Help The Minority: Context-rich Minority Oversampling for Long-tailed Classification Seulki Park, Youngkyu Hong, Byeongho Heo, Sangdoo Yun, Jin Young Choi. CVPR, 2022 Bibtex / Code Data oversampling is a simple solution for long-tailed classification, but it may exacerbate overfitting with limited context information. Motivated from CutMix, we introduce a simple context-rich oversampling method. Interestingly, majority classes play a key role for boosting classification accuracy of minority classes!
	Hypergraph-Induced Semantic Tuplet Loss for Deep Metric Learning Jongin Lim Sangdoo Yun, Seulki Park, Jin Young Choi. CVPR, 2022 Bibtex / Code We formulate deep metric learning as a hypergraph node classification problem to capture multilateral relationship by semantic tuples beyond previous pairwise relationship-based methods.
	Which shortcut cues will dnns choose? a study from the parameter-space perspective Luca Scimeca, Seong Joon Oh, Sanghyuk Chun, Michael Poli, Sangdoo Yun. Equal contribution ICLR, 2022 Bibtex / OpenReview What causes shortcut learning problem? We observe the model's behaviors when we provide equal chance of being fit to multiple cues (e.g., color and shape with equal chance). Interestingly, the model would like to fit into a certain cue (e.g.*, color than shape) in such even situation. This paper explains the reason in terms of parameter-space perspective.
	Detecting and Removing Text in the Wild Junho Cho, Sangdoo Yun, Dongyoon Han, Byeongho Heo, Jin Young Choi. IEEE Access, 2021 Bibtex Unifyied text detection and text removal framework for scene text removal in the wild.
	Rethinking spatial dimensions of vision transformers Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh. ICCV, 2021 Bibtex / Code The Vision transformer (ViT) has become a strong design principle for vision modeling. Because ViT is originated from NLP's Transformer, it has no intermediate pooling layers, which is common in CNNs. We simply inject the pooling concept on ViT and introduce a new architecture PiT.
	Normalization Matters in Weakly Supervised Object Localization Jeesoo Kim, Junsuk Choe, Sangdoo Yun, Nojun Kwak. ICCV, 2021 Bibtex / Code We investigates the effect of CAM (CVPR'16) normalization on WSOL, and suggest a new normalization method.
	Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Junsuk Choe, Sanghyuk Chun. CVPR, 2021 Bibtex / Code / Video / Poster ImageNet has lots of label noises and there have been efforts to fix them on the evaluation set (e.g. Shankar et al., Bayer et al.). We paid our attention to the training set, whose label noises have been overlooked, and release the re-labeled ImageNet and codebase (published at this repo). The re-labeled data improves the ImageNet and downstream task accuracies.
	Rethinking Channel Dimensions for Efficient Model Design Dongyoon Han, Sangdoo Yun, Byeongho Heo, Youngjoon Yoo. CVPR, 2021 Bibtex / Code CNN architectures (e.g., ResNet, MobileNet, etc.) usually follows the same feature-map down-sampling policy. We conjecture such design policy would harm the representation ability of intermediate layers. We analyze the feature-map's rank (inspired by softmax-bottleneck) and suggests a new network architecture, namely, Rank eXpanded Network (ReXNet).
	AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Youngjung Uh, Sangdoo Yun, Jungwoo Ha. Equal contribution ICLR, 2021 Bibtex / Code / Project Adding projection operation on Adam and SGD optimizer to mitigate slowdown of convergence due to rapidly increased norm. It leads to performance improvements across the board with easy installation (pip install adamp*).
	VideoMix: Rethinking Data Augmentation for Video Classification Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Jinhyung Kim. arXiv, 2020 Bibtex Extension of CutMix to video recognition. We search for the best mixing strategy for video tasks.
	Learning De-biased Representations with Biased Representations Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, Seong Joon Oh, ICML, 2020 Bibtex / Code / ICML Virtual / Youtube Models tend to learn biased representations. To "de-bias" model representation, we "minus" biased representation from the target model.
	EXTD: Extremely tiny face detector via iterative filter reuse Youngjoon Yoo, Dongyoon Han, Sangdoo Yun. arXiv, 2019 Bibtex / Code Face detector has multi-stage for multi-resolution, but it indeed does not require such complex feature encoding. We introduce an extremely tiny face detector via iterative filter reuse.
	CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, Youngjoon Yoo. ICCV, 2019 (Oral Presentation) Bibtex / Code / Talk / Poster / Blog Simple cut-and-paste strategy brings significant performance boosts across tasks and datasets.
	What Is Wrong with Scene Text Recognition Model Comparisons? Dataset and Model Analysis Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, Hwalsuk Lee. ICCV, 2019 (Oral Presentation) Bibtex / Code Scene text recognition evaluation has been somewhat wrong because the model and dataset were not controlled. We provide unified benchmark protocol and fairly reproduced results. We also found a new architecture from those unified experiments.
	A Comprehensive Overhaul of Feature Distillation Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, Jin Young Choi. ICCV, 2019 Bibtex / Code There are lots of options for feature distillation: loss function, distillation position, teacher/student transforms. We study all the possible methods and provide comprehensive overhaul for feature distillation. Through this, we found the best feature distillation method which even beats the teacher's accuracy.
	An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods Sanghyuk Chun, Seong Joon Oh, Sangdoo Yun, Dongyoon Han, Junsuk Choe, Youngjoon Yoo. ICML Workshop, 2019 Bibtex We provide structured experimental results for the effectiveness of regularization methods on robustness and uncertainty benchmarks.
	Character Region Awareness for Text Detection Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, Hwalsuk Lee. CVPR, 2019 Bibtex / Code Text detectors often fail to detect real-world scene-texts, e.g., curved or long texts. We propose a two-stage approach; first detect individual characters and connect them. We also introduce semi-weakly-supervised training trick to boost our detector's performance.
	Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons Byeongho Heo, Minsik Lee, Sangdoo Yun, Jin Young Choi. AAAI, 2019 (Oral Presentation) Bibtex / Code Previous feature distillation approach (e.g. FitNet) focuses on mimicking the teacher's feature values. Rather, our goal is to transfer the actual "activation boundary" by assigning binary labels (i.e. activated or not) for all the neurons. Our loss minimizes the binary-labels' similarity. It shows outperforming performance against state-of-the-art KD methods.
	Knowledge Distillation with Adversarial Samples Supporting Decision Boundary Byeongho Heo, Minsik Lee, Sangdoo Yun, Jin Young Choi. AAAI, 2019 Bibtex / Code To find teacher network's decision boundary more precisely, we adopt adversarial attack technique. We show the attacked samples improve distillation performance.
	Unsupervised Holistic Image Generation from Key Local Patches Donghoon Lee, Sangdoo Yun, Sungjoon Choi, Hwiyeon Yoo, Ming-Hsuan Yang, Songhwai Oh ECCV, 2018 Bibtex / Code We train a GAN model that generates a holistic image from its small parts.
	Context-aware Deep Feature Compression for High-speed Visual Tracking Jongwon Choi, Hyung Jin Chang, Tobias Fischer, Sangdoo Yun, Kyuewang Lee, Jiyeoup Jeong, Yiannis Demiris, Jin Young Choi. CVPR, 2018 Bibtex / Code Correlation-based trackers have shown promising performance using hand-crafted features (e.g., HOG). When adopting deep features for correlation-based trackers, the bottleneck is the computing costs for CNN feature extraction. We propose a deep feature compression method for high-speed and high-accuracy visual tracker.
	Action-Driven Visual Object Tracking with Deep Reinforcement Learning Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, Jin Young Choi. TNNLS, 2018 Bibtex / Code A journal extension of ADNet (CVPR'17).
	Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, Jin Young Choi. CVPR, 2017 (Spotlight Presentation) Bibtex / Code We fomulate visual tracking as a decision making process and propose a reinforcement learning method to train visual trackers. Our RL-based tracker shows state-of-the-art level performance and especially it shows high efficiency with semi-supervised scenario.
	Variational Autoencoded Regression: High Dimensional Regression of Visual Data on Complex Manifold Youngjoon Yoo, Sangdoo Yun, Hyung Jin Chang, Yiannis Demiris, Jin Young Choi. CVPR, 2017 Bibtex / Code Generating visual data from given condition (e.g. frame index, pose skeleton, etc.) is difficult due to the visual data's high dimensions. Our idea is to regress the visual data in latent space which is encoded by VAE. Our method can generate high-quality visual data from frame index or pose skeletons.
	Attentional Correlation Filter Network for Adaptive Visual Tracking Jongwon Choi, Hyung Jin Chang, Sangdoo Yun, Tobias Fischer, Yiannis Demiris, Jin Young Choi. CVPR, 2017 Bibtex / Code Correlation-filter-based trackers usually use pre-defined feature extractor (e.g., color, edge, etc). Using more correlation filters with diverse feature extractors at the same time will bring higher accuracy, but it induces speed-accuracy trade-off. This work extends the number of correlation filters more than one hundred for maximizing accuracy. To deal with heavy computation, we introduce a LSTM-based attentional filter selection approach. Our method the state-of-the-art performance amongst real-time trackers.
	PaletteNet: Image Recolorization with Given Color Palette Junho Cho, Sangdoo Yun, Kyoung Mu Lee, Jin Young Choi. CVPR Workshop, 2017 Bibtex We propose a image colorization method from the given palette.
	Butterfly Effect: Bidirectional Control of Classification Performance by Small Additive Perturbation Youngjoon Yoo, Seonguk Park, Junyoung Choi, Sangdoo Yun, Nojun Kwak. arXiv, 2017 Bibtex This paper proposes a new algorithm for controlling classification results by generating a small perturbation without changing the classifier network. We show that the perturbation can degrade the performance like adversarial attack, or can improve classification accuracy as well.
	Visual Path Prediction in Complex Scenes with Crowded Moving Objects Youngjoon Yoo, Kimin Yun, Sangdoo Yun, JongHee Hong, Hawook Jeong, Jin Young Choi. CVPR, 2016 Bibtex Learn latent Dirichlet allocation model from the trajectory of people and predict future paths of people.
	Density-aware Pedestrian Proposal Networks for Robust People Detection in Crowded Scenes Sangdoo Yun, Kimin Yun, Jongwon Choi, Jin Young Choi. ECCV Workshop, 2016 Bibtex Detecting people in crowded scene by considering crowd density information. Our intuition is more people should be detected in crowded region.
	Voting-based 3D Object Cuboid Detection Robust to Partial Occlusion from RGB-D Images Sangdoo Yun, Hawook Jeong, Soo Wan Kim, Jin Young Choi. WACV, 2016 Bibtex Predicting holistic 3D structure from pratially occluded RGB-D images. The key idea is a voting mechanism. Each part of an object indicates the center of the 3D structure.
	Visual Surveillance Briefing System: Event-based Video Retrieval and Summarization Sangdoo Yun, Kimin Yun, Soo Wan Kim, Youngjoon Yoo, Jiyeoup Jeong. AVSS, 2014 (Oral Presentation) Bibtex We propose a Visual Surveillance Briefing (VSB) system which generates summarized video with important events.
	Self-organizing Cascaded Structure of Deformable Part Models for Fast Object Detection Sangdoo Yun, Hawook Jeong, Woo-Sung Kang, Byeongho Heo, Jin Young Choi. ICPR, 2014 Bibtex We improve the computational efficiency of deformable part model (DPM) by re-organizing the order of part filters. With a cascaded structure, we place more important part filter at first for early rejection.
	Multiple ground plane estimation for 3D scene understanding using a monocular camera Sangdoo Yun, Soo Wan Kim, Kwang Moo Yi, Haan-ju Yoo, Jin Young Choi. IVCNZ, 2012 (Oral Presentation) Bibtex Ground plain estimation is important for 3D scene understanding. Usually models assume the scene has a single ground plain, but sometimes it has multiple ground planes. We introduce multiple ground plane estimation for more robust scene understanding.

Academic service

Workshop on ImageNet: Past, Present, and Future.
Zeynep Akata, Lucas Beyer, Sanghyuk Chun, Almut Sophia Koepke, Diane Larlus, Seong Joon Oh, Rafael Sampaio de Rezende, Sangdoo Yun, Xiaohua Zhai.
NeurIPS, 2021
Website / Virtual Page / Preview in CV News

ImageNet has played an important role in CV and ML in the last decade. It was created to train image classifiers at first but it has become a go-to benchmark for model architecture and training techniques. We believe now is a good time to discuss the ImageNet and its future. The workshop's questions will be like: Did we solve ImageNet? What have we learned from ImageNet? What should the next-generation ImageNet-like dataset be?

Lectures

AI773: Topics in Artificial Intelligence - Multimodal Deep Learning Theories and Applications

Reviewing activities

Serve as an area chair at NeurIPS'23'24'25 D&B, ECCV'24. ICLR'25, CVPR'25
Serve as a meta-reviewer at AAAI'22, AAAI'23, AAAI'24
Serve as a reviewer at CVPR, ICCV, ECCV, ICML, NeurIPS, ICLR, AAAI, etc.
Outstanding reviewer awards at CVPR'21, ICCV'21, CVPR'22.

Talks

Towards Strong and Robust Deep Models -– Insights from Data and Supervision, RCV Workshop @ ICCV 2023
Towards Strong and Robust Deep Models -– Insights from Data and Supervision, POSTECH, Sep, 2023
Towards Strong and Robust Deep Models -– Insights from Data and Supervision, UNIST, Sep, 2023
How to Make Deep Models Strong and Robust, Sogang Uviv, Sep, 2022
How to Make Deep Models Strong and Robust, Korea Uviv, Apr, 2022
How to Make Deep Models Strong and Robust, UNIST, Jun, 2021

Template borrowed from Jon Barron and Seong Joon Oh.