Towards the goal of constructing a large-scale bench-mark with high diversity, we proposed a hierarchical structure to organize our dataset. The figure above presents the illustration of our lexicon, which contains three levels from roots to leaves: domain, task and step.

 • Domain: For the first level of COIN, we choose 12 domains as: nursing & caring, vehicles, leisure & performance, gadgets, electric appliances, household items, science & craft, plants & fruits, snacks & drinks dishes, sports, and housework.

 • Task: As the second level, the task is linked to the domain. For example, the tasks "replace a bulb" and "install a ceiling fan" are associated with the domain "electrical appliances".

 • Step: The third level of the lexicon are various series of steps to complete different tasks. For example, steps "remove the lampshade", "take out the old bulb", "install the new bulb" and "install the lampshade" are associated with the tasks "replace a bulb".


The COIN dataset consists of 11,827 videos related to 180 different tasks, which were all collected from YouTube. The average length of a video is 2.36 minutes. Each video is labelled with 3.91 step segments, where each segment lasts 14.91 seconds on average. In total, the dataset contains videos of 476 hours, with 46,354 annotated segments.


Please register or log in to download the dataset.

I already have an account

Please register or log in to download the dataset.

Log in
I don't have an account yet

The dataset and annotation
We store the urls of videos and their annotations in JSON format, which can be accessed with the link COIN.
You may use the script to download the raw videos from YouTube.
We are preparing one copy of the COIN dataset. Please email tys15@mails.tsinghua.edu.cn to obtain the dataset.
The features we used for weakly-supervised action segmentation can be obtained from here.

The COIN is organized in a hierarchical structure, which contains three levels: domain, task and step. The corresponding relationship can be found at taxonomy.

Benchmark and Evaluation
In order to provide a benchmark for instruction video analysis, we evaluate plenty of approaches on the COIN dataset under different evaluation criteria. See source codes for more details.

FA = correctly predicted frames / total frames

Annotation Tool

Given an instructional video, the goal of annotation is to label the step categories and the corresponding segments. As the segments are variant in length and content, it will cost huge workload to label the COIN with conventional annotation tool. In order to improve the annotation efficiency, we have developed a new toolbox for annotation. Please find the toolbox at the following link if you are interested.



Jiwen Lu

Associate Professor

Tsinghua University

Lili Zhao

This project was done while Lili Zhao was in Meitu Inc.

Jie Zhou


Tsinghua University

Yansong Tang

Ph.D. candidate

Tsinghua University

Dajun Ding

Meitu Inc.

Yu Zheng

Undergraduate student

Tsinghua University


Yansong Tang*,  Dajun Ding,  Yongming Rao*,  Yu Zheng*, Danyang Zhang*,  Lili Zhao,  Jiwen Lu*,  Jie Zhou*

Other Contributors

Yongxiang Lian*,  Yao Li,  Jiali Sun,  Chang Liu,  Dongge You,  Zirun Yang,  Jiaojiao Ge,  Jiayun Wang*

*Tsinghua University,   Meitu Inc.


A preprint is available on arXiv. Please cite the following paper if COIN is useful to your research:

    title={COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis},
    author={Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, Jie Zhou},
    booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},