简而言之

AI公司的信条是更多数据会带来更好的性能,但实际上数据规模并非你所需的全部。高质量数据相比更大的低质量数据集能产生更好的性能。生产高质量数据需要过滤噪声、理解未标注数据,以及理解应该标注什么。通过标注平台进行大规模数据标注也存在问题,因为它们的激励机制往往不一致,而且它们的平台是一个瓶颈,既耗时、容易出错又成本高昂。改进AI系统的最佳方式是通过使用自监督表示学习、基础建模和过滤等方法,以可交互的方式智能地表示数据集,从而理解输入模型的数据。这些实践可以防止AI系统性能不佳的风险以及生成有害输出的风险。

少即是多

数据规模并非你所需的全部。在预训练模型时盲目增加数据集的规模会让AI优先的公司面临犯严重错误的风险。在分布未知的大型数据集上训练模型会导致意外行为:在机器人领域,这可能导致错误和危险的轨迹;对于医疗保健公司,可能导致不准确的风险评估;对于大语言模型,可能生成有害言论{9}。在X平台上,Grok就犯了这个错误,在图0a所示的现已删除的帖子中生成了有害言论。甚至xAI的CEO也承认他们需要"对训练数据更加挑剔,而不是仅仅在整个互联网上进行训练"。但是你如何正确选择数据来正确训练和评估这些模型呢?有哪些工具可用?

解决方案是以可交互且语义上足够多样化的形式智能地表示数据。这种方法有助于:1. 为预训练和后训练创建训练和评估数据集,2. 识别数据中的空白,3. 就如何填补这些空白提出建议(通过购买或收集)。

Figure 0a: Examples of an LLM generating harmful speech likely due to existence of similar text in the training data the xAI team used to train Grok.

Figure 0b: Reaction from the xAI CEO after Grok generated harmful speech. The interesting piece is the teams focus on being selective of the training data. Original post from the Grok CEO https://x.com/elonmusk/status/1944132781745090819

数据飞轮{10}与标注公司

在行业中,大多数AI公司的CEO、AI研究人员和工程师对将自己整合到数据飞轮中的现代标注公司不满意。

AI公司目前的首选解决方案是积累大量未标注数据集用于预训练(或使用开源预训练模型),然后标注另一个针对预期任务的大型数据集,最后手工策划训练集和评估集。标注工作通常外包给标注公司(ScaleAI、SuperAnnotate、Labelbox等),这些公司将自己整合到数据引擎中。但是标注大型数据集中的所有内容效果不佳,因为将数据标注扩展到数百万或数十亿个样本容易出错、成本高得不可持续且耗时,让AI公司感到不满。更重要的是,标注循环是一个永无止境的过程,因为数据飞轮会不断适应不断演进的模型和更多收集的数据,使标注需求具有流动性并随时间变化;标注公司无法跟上变化的速度,因为模型更新可能在几周内发生,而标注可能需要几个月。

数据引擎中的现代标注循环是:

收集一些数据。
设计或更新一些标注规范。
将数据和规范发送给某个标注公司(Scale、SuperAnnotate等)。支付标注费用。
与标注公司迭代并训练模型。
观察结果,然后无限期地重复步骤2-5。

例如,一家自动驾驶公司可能想要标注停车标志,但在标注了100万个停车标志并看到结果后,他们意识到想要标注停车标志的"可见性",然后他们意识到还想标注可能围绕停车标志的树木,添加"遮挡"标签。现在所有数据(同时数据也在增长,因为数据收集是持续的)都需要重新标注!只要公司在改进模型,这个循环就永远不会结束!

Meta花费143亿美元获得49%的股份以聘请Scale.AI的CEO[11],这可能是该公司有史以来最冒险的举措之一,因为标注公司存在这些困难。

那么,如果在庞大数据集上盲目训练存在问题,而标注所有内容又很困难,我们还应该做什么?在过去四年中研究这个问题后,我们发现最好的解决方案是充分表示数据,以便更容易选择和理解我们数据中的内容以及这些数据如何影响我们的模型。我们应该能够以一种让我们快速搜索示例并快速构建评估集来测试模型的方式与我们的数据对话。

这就是我们在Interpret AI正在构建的。我们正在构建一个数据内省平台、数据策划平台和智能数据市场,允许构建AI系统的公司交互和理解他们的数据集。我们设想一个世界,你可以使用自然语言、音频、图像和视频与你的数据对话,搜索相似实例,以便公司能够信任并了解驱动其模型的数据(或数据中的空白)。(如果这些内容引起你的共鸣,请随时联系ily@interpretai.tech)

首先扩展可能有帮助的内容

传统数据飞轮

Figure 1a: The traditional data engine powering AI solutions in companies.

公司拥有一些不断将数据收集到数据集中的基础设施(1b)。然后团队创建启发式数据子集,希望一旦标注后能改进他们的模型(1a)。
数据被发送到标注公司。标注公司生成标签(标注),然后由团队审查,这可能需要几个月的来回沟通才能收敛。
然后对预训练的AI模型进行预训练。
然后使用标注公司的标签对预训练模型进行微调。
使用公司的评估系统评估最终模型,生成指标。
然后公司使用这些反馈来可能选择其他数据子集、更新标注要求和/或进行模型更改。请注意,到这个时候,数据集子集已经开始过时。

注意:指标可能因标注不佳而出现偏差,需要团队不断迭代,既昂贵又时间效率低下(6)。

Figure 1b: A breakdown of the time requirements for different processes in a traditional company’s approach to solutions. Notice that the major bottleneck is getting labels from a labeling company.

图1b:传统公司的AI系统时间约束和设置,以及独立迭代这些部分的大致时间表。请注意,有标注公司参与循环时,生成能够正确改进AI模型的标签需要几个月的迭代。有关这些部分如何与传统公司交互,请参见图1a。

Interpret AI的数据飞轮:

从深入的数据洞察开始了解

Figure 2a: Interpret’s AI data flywheel & how we provide immediate data insights.

图2a:Interpret AI的数据飞轮。

立即提供数据子集推荐和增强的数据建议,用于预训练和训练(分别为1a和1b)。
团队现在在发送给标注公司之前审查Interpret建议的明显更小的数据子集。这些数据子集是流动的,并随着数据变化而持续更新(可选地,如果公司集成其基线模型,Interpret AI可以提供更多关于数据如何影响模型性能的见解)。
与标注公司的来回沟通从几个月加速到几周,并且由于标注规范和数据集选择明确,成本显著降低。

反馈集中在模型上(6)。
最后,Interpret AI分析你的数据空间,提供关于收集或购买哪些数据以加速模型改进的见解。

Figure 2b: A breakdown of the time requirements for different processes in using Interpret’s platform. On the left hand side feedback iteration speed in green is accelerated. Notice there is no more bottleneck.

图2b:该图展示了Interpret AI如何直接与我们的客户集成,以加速模型训练、数据分类和理解以及评估。Interpret AI为以下方面提供解决方案:

理解现有数据分布。
识别与数据空白相关的模型空白。
购买和策划数据以填补数据空白。

用例

我们与机器人、医疗保健和代理大语言模型行业的多家企业合作。如果这些内容引起你的共鸣,请随时联系ily@interpretai.tech

医疗保健

HealthCo正在尝试预测其患者的心血管疾病风险。

用于训练

Interpret AI使用我们的解释基础模型分析心血管数据,处理电子健康记录、图像,如果可用还可能处理心电图数据[12]。
Interpret AI注意到HealthCo中的异常或"空白",并描述这些人的人口统计特征(即女性、中年、无子女、历史上开过曲美他嗪处方)。
这些检测到的记录由专家进一步分析。然后可以更新、忽略所选数据,用于帮助购买更多历史上开过曲美他嗪处方的人的数据,或发送给标注公司来标注这个特定群体。
然后使用所选数据训练AI心血管疾病模型。如果HealthCo将其心血管模型集成到Interpret平台中,那么我们会进一步实时分析模型在哪些方面表现不佳,允许立即进行内省。
这个过程将模型训练时间从几个月缩短到几周,快速改进AI系统并节省成本!

用于安全

假设HealthCo有心脏病发作患者的示例,他们想分析与这个人相似的其他可能也有风险的人的电子健康记录。

使用Interpret AI,HealthCo可以选择这个人的示例并搜索相关人群,按置信度排序。
这些人可以被标记为有风险,从数百万条记录中快速识别出几百个有风险的人!

机器人

DriveCo正在制造自动赛车,作为孩子们在户外玩耍的玩具。

用于训练

Interpret AI分析收集的赛车视频数据运行。Interpret AI提供数据报告。
Interpret AI注意到视频中的大多数回放在地理上不够多样化,而且在后院户外驾驶赛车的示例很少。
Interpret AI建议DriveCo团队收集更多户外视频示例。我们还尝试使用我们的Interpret AI基础模型以学习的方式平衡数据集,以缓解这种不平衡。
- 如果没有Interpret AI,DriveCo可能会发送超过1000小时的赛车数据来标注不需要的对象!现在他们只需要标注10小时!

用于安全

假设这些自动赛车因婴儿安全而受到审查。

DriveCo可以在其数据库中搜索包含"婴儿"的视频,看看他们是否有这些数据。
如果DriveCo没有这些数据,这会通知团队去收集它(我希望使用假婴儿),或者这允许DriveCo向消费者和投资者展示产品在婴儿周围确实是安全的!

我们如何走到这一步

标签和预训练的简史

2015年,在Transformer之前,大多数模型被训练来解决非常特定的问题子集:分类、分割、目标检测(即基础问题)等[1]。基准测试是"相当大"的标注数据集,规模在1万到100万之间。{1}

现代预训练在2017年左右进入讨论并改变了游戏规则。借鉴表示学习,预训练作为一种基本范式转变出现,突然间未标注数据集在模型性能上释放了巨大收益。用于预训练的未标注数据集与其标注的同类相比是巨大的[5]。这与其他技术和进步{2}相结合,催生了现代基础模型,如CLIP[13]、DALL-E[14]、DINOv2[15]和BERT[16]等。

然后OpenAI在Transformer、预训练和强化学习进展的基础上,在发布GPT(生成式预训练Transformer)[6]时改变了游戏规则。Sora[7]、DeepSeek[8]、Anthropic[9]都使用在大型数据集上的预训练作为其高性能模型的支柱。但其中隐藏着一个大多数人没有谈论的敏锐观察。

虽然预训练是一个良好的第一步,但大多数这些模型需要在预训练基础上进行进一步训练。无论是强化学习还是监督微调,最高性能的模型都以某种方式与原始问题对齐{3}。但即使微调也会扩展到一定程度,这意味着改进预训练对未来模型性能至关重要{4}。

文献中如何正确整合预训练和构建数据飞轮的最引人注目的例子之一是Meta在Segment Anything Model(SAM)和SAM v2[10]中构建的标注数据飞轮。但即使在这个例子中,数据标注也非常难以扩展。

Segment Anything:创新与信息

简而言之:SAM向我们展示的是质量保证和理解我们数据中的内容很困难,但这是一个需要解决的重要问题。添加更多数据不一定是答案。

SAM构建了一个数据飞轮,使用处于不同训练阶段的部分训练的SAM,结合人工标签反馈,策划了一个大型标注数据集。他们的方法说明了将标注整合到管道中的正确方式,但也突出了即使是正确的数据标注飞轮也是昂贵且难以扩展的。在某个时刻,数据集增长得足够大,人类无法标注所有内容,因此需要其他内省方法(即Interpret正在构建的)。

粗略地说,SAM的方法是[10]

从MAE预训练的层次化ViT开始。
在公开可用的分割数据集上训练SAM。
使用部分训练的SAM在数据子集上生成分割掩码。
让人类完善分割预测。然后还使用掩码训练目标检测器以找到更多对象,并让人类标注。
重复步骤3-4,逐渐增加数据集的大小。
最后在10亿张图像上运行以获得SA-1B。使用QA团队标记潜在的不良示例。请注意,为所有10亿张图像提供人工标签是极其困难的。

SAM 2的想法是相同的,它是一个视频分割模型,生成了SA-V数据集,在50.9K个视频中有3550万个掩码,比任何视频分割数据集多53倍的掩码[10]。

请注意,最好的分割模型是使用与其任务直接相关的数据进行训练的,其中标签反馈都在快速、高效的数据飞轮中很好地耦合。预训练然后使用开源分割数据集集合进行训练只是第一步和第二步。

还要注意,人工标注最终达到了上限;当数据飞轮开始标注10亿张图像时,Meta仍然需要运行QA过滤器来标记不良示例。根据论文,标注所有11亿个掩码需要51,000天的标注时间!{5}

我们谈论的是Meta,但对大多数公司来说,雇用这么多人将是极其昂贵且不可行的!{6}在这种规模上标注就是很困难!

重申简而言之,SAM向我们展示的是质量保证和理解我们数据中的内容很困难,但这是一个需要解决的重要问题。这从根本上是我们今天在行业中看到的差距:用于预训练或微调的更多数据不一定是答案。正确的方法是识别模型在哪里表现不佳,理解为什么在那里表现不佳,然后突出显示与问题相关的数据(或数据空白),这就是我们在Interpret AI正在做的。

标注公司的目标不一定与你的目标一致……

我们在MAANG有行业经验,我们的团队有与Scale、SuperAnnotate等标注公司合作的经验。对于大多数标注公司来说,商业模式是:

让公司生成自己的标注规范,根据标签的复杂性可能会有一些来回沟通。
大多数标注公司有不同层级的标注员,最大的池是标注所有内容的非专家,最小的是该领域的专家(即医生)。然后标注公司召集一批人工标注员,通常从最便宜的开始进行低质量的第一遍。
然后标注员根据公司复杂的标注规范尽力标注,按标注收费。
提供反馈并更新标注,可能更新标注规范。

这个过程有四个主要问题:

标注不一致,通常没有分配给合适的标注员,
标注耗时且昂贵,
纠正标注的反馈循环容易出错,以及
标注规范随着模型性能的变化而随时间变化。

针对1.,标注员不能保证适合其分配的标注任务,并且经常与同行标注不同。例如,对于医疗保健公司,如果任务是"选择最能诊断患者的临床响应",这些标注员甚至可能不是适合该任务的医生!此外,对于自动驾驶公司,如果任务是"为停车标志绘制边界框",这是否包括杆子?如果是停车标志的背面呢?不同的标注员在不相互咨询的情况下会标注不同。

针对2.,按标注收费在理论上听起来很好,因为传统观念是更多标签有帮助,但前提是公司能够负担足够数量的标签来提升模型性能;这个数字通常是未知的。这些标注通常也会有错误,需要AI公司构建内部系统来审查标注,这既需要时间(几个月的数量级)又需要更多资金。

针对3.,反馈循环也不一致。通常,标注验证的责任被推给AI公司,该公司需要建立自己的内部监控系统(已经耗时且昂贵)。当AI公司注意到标注问题时,不能保证纠正来自创建有问题标签的同一标注员,有时标注公司会重新标注整个有问题的示例而不是纠正它,这会花费更多。例如,自动驾驶公司可能想要标注交通灯和人的实例掩码。在这个虚拟示例中,第一个标注员犯了一个错误,忘记标注未面向摄像头的交通灯。AI公司标记它并将其发送以重新审查,但标注公司修复此问题的方式是将图像发送给新的标注员,后者从头开始重新标注所有内容!第二个标注员修复了原始问题,但没有将警察标注为"人",现在出现了新问题!参见图3a和图3b。这个循环正确标注对象的概率非常低,对于50个标签约为61%{7}。

Figure 3a: First pass by the first annotator who missed the traffic lights that are not facing the camera. (Image from Waymo Open Dataset [17])

Figure 3b: Second pass from the second annotator who got all the traffic lights but didn’t realize that the “people” class included police officers! (Image from Waymo Open Dataset [17])

本质上,使用这种反馈系统,标注公司创建的标签不能保证收敛到正确的标签!

AI公司的激励机制与标注公司的激励机制不一致。AI公司想要改进其AI模型和产品,而标注公司想要尽可能多地标注公司数据,以便他们可以收费。你想让你的模型表现出色,标注公司也应该如此。

针对4.,在行业(和研究)中,当试图解决问题时,有许多可能的解决方案。也许在整个互联网上进行预训练会改进你的大语言模型,或者也许通过在标注的文本-图像对上训练来基础大语言模型会帮助大语言模型推理,或者也许添加思维链会有帮助。换句话说,在设计AI系统时,我们需要并行尝试很多不同的事情,因为有时不清楚什么是最好的方法。标注是一种解决方案,这意味着随着我们更好地理解我们的问题,标签定义可能会发生变化。

例如,在自动驾驶中标注停车标志;假设我们首先标注停车标志。我们注意到,当我们知道停车标志是否部分遮挡时,性能会提高,因此我们稍后更新标注规范,添加一个名为"遮挡"的元数据标签,用于标志部分可见或不可见时。然后我们回到标注公司,要求他们用这个重新标注我们所有的停车标志!这种"循环中的标注平台"意味着每个更新标注数据集的模型实验都非常昂贵!

那么,人们可能会想,为什么还要使用标注提供商呢?有两个原因:首先,如前所述,数据上的高质量标签确实有帮助。事实上,具有更高质量标签的更少数据可以胜过一些大型预训练模型;SAM就是一个很好的例子。其次,不使用标注公司的替代方案是创建内部标注平台,这更加昂贵和耗时,因为生产与其他参与者相同数量的标签可能需要数年时间!

结论

最优数据飞轮以本质上具有洞察力和可交互性的形式表示数据:我们应该能够检测异常,也能够与我们的数据对话以获得有趣的模式和见解。这个飞轮应该通过关注应该标注什么而不是标注所有内容来增强标注平台{8}。最后,这个数据飞轮应该与模型性能保持一致,直接与你的AI公司正在解决的任何问题相关联。

传统观念是更多数据"就是有效",有时深度学习感觉像炼金术。也许更多数据在短期内对你有效,但当事情"就是不起作用"时,正确的方法是评估数据和模型中的失败,并从那里开始工作。

在Interpret,我们希望改变这种范式。如果你感兴趣,请通过ily@interpretai.tech与我们联系

Footnotes

Back when AlexNet was still a thing circa 2015ish most models for computer vision were trained on a subset of very particular problem types: classification, segmentation, object detection (ie foundation problems) and others like image captioning, scene recognition, pose estimation (see appendix for more details)[1]. Note this was pre “Attention is all you need” when bigrams were a-la-mode. The focus then was model development while benchmarks remained fixed. These benchmarks were “largish” labeled datasets (order of 10k to 1M) that were used to evaluate model performance. Some of the popular CV benchmarks you’re probably familiar with are MNIST, ImageNet, MS COCO, KITTI, Caltech-101 [2]. If you look the largest labeled datasets around this time they were around 1M labels, and that was considered large at the time.
Modern pretraining entered the chat around 2017 and changed the game. Borrowing from representation learning, pretraining came as a fundamental paradigm shift from learning features for only a specific labeled dataset to learning general features on unlabeled data that correlated well with other problems like classification, segmentation, object detection. These datasets compared to their labeled brethern were massive [5]. At the same time, advancements in model training (CUDA optimization which is why NVIDIA hit a 4T market cap), deep learning libraries (tensorflow, pytroch), and new / improved model architectures like Transformers from “Attention Is All You Need” opened up a brand new world. Researchers also noticed that increasing the size of models typically correlated with improved performance on unseen data (from the same data distribution). All of this combined interfaced with modern pretraining algorithms like pretext tasks, contrastive learning, masked label modeling, masked autoencoding (MAE) multimodal modeling [4] unlocking the era of training big models on even massive unlabeled datasets. Ergo, models like CLIP [13], DALL-E [14], DINOv2 [15], BERT [16].
”Alignment” is an overused term I mean alignment in both the “we want our LLM to be helpful not harmful” sense and the “data distribution alignment” sense.
When training / fine-tuning a model, scaling model size correlates with improvement in performance roughly following a power law. In industry, we’re already hitting the peak for model size scaling laws and fine-tuning is giving less and less of an advantage. The next frontier is improving pretraining method to better utilize existing unlabeled datasets.
In the SAM paper, annotations could take 30 seconds (but suppose it took 4 seconds based on the improvements from SAM v2 [10]); reviewing 1.1B masks would’ve required 1,100,000,000 * 4 seconds = ~51,000 days of annotation time!
This is also assuming that the data distribution is stationary (unchanging). If we wanted to increase the labels to a different data distribution (say deep sea diving videos where the semantics & dynamics of objects is different) then finetuning SAM would still require the same data flywheel training process which is also more time and more money.
Suppose that each object has a probability of being mislabeled p=0.01 (ie an annotator labels incorrectly or misses a label once every 100 labels). Assuming 50 objects in a video the probability of succeeding assuming independence is (1 - p)^50 = 61% chance of success! And that’s conservative.
Fundamentally, when AI companies have better clarity on what to label their incentives align with annotation companies.
More and more it is clear very few samples (e.g. thousands) of very high quality data is way better than million of low quality data - this is particularly true in post-traning of LLMs in industry but it is starting to be the focus also of pre-training.
A data flywheel is the loop used to collect data, improve the model, which makes a better product, which then modifies what data to collect and the cycle repeats (for example this image from dataloop.ai https://dataloop.ai/book/the-data-flywheel-effect/). A data engine is the infra for collecting/labeling/evaluating data (for example Scale’s product https://scale.com/data-engine).

Special Thanks

Cameron Tukerman-Lee (also credit for the title)
Gabriele Sorrento
Francesco Pongetti
Lotfi Herzi

Appendix

[1] A more extensive list of popular 2015 foundational problems across different domains so sortof pre multi-modal.
- Computer vision
  - classification
  - segmentation
  - object detection
  - image captioning
  - scene recognition
  - pose estimation
  - Optical Flow Estimation
  - Depth Estimation
  - Face recognition
  - Pose estimation
  - Visual tracking
  - Style transfer
  - Image generation
- Natural Language Processing
  - Machine translation
  - Part of speech tagging
  - Question answering
- Speech Processing
  - Speech recognition
  - Speaker identification
  - Emotion classification
- Time series
- Reinforcement Learning
[2] Popular datasets separated by domain around 2015 Classification: Segmentation: Object Detection: Other Tasks: Depth Estimation: Optical Flow: Pose Estimation: Face Recognition: Video/Action Recognition: Attributes/Multi-label: Reinforcement Learning: Can think of dataset size as number of rollouts.
- ImageNet (ILSVRC 2017) - 1.2M training, 1000 classes - https://www.image-net.org/challenges/LSVRC/2017/index.php
- CIFAR-10/100 - 60K (32x32), 10/100 classes - https://www.cs.toronto.edu/~kriz/cifar.html
- MNIST - 70K handwritten digits - https://www.kaggle.com/datasets/hojjatk/mnist-dataset
- Fashion-MNIST - 70K fashion items - https://github.com/zalandoresearch/fashion-mnist
- SVHN - 600K real world house numbers 10 classes for each digit - http://ufldl.stanford.edu/housenumbers/
- Caltech-101/256 - 9K/30K images 101/256 categories - https://data.caltech.edu/records/mzrjq-6wc02, https://data.caltech.edu/records/nyy15-4j048
- Oxford Flowers 102 - 102 categories - https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
- Oxford-IIIT Pets - 7.4K images, 37 pet breeds - https://www.robots.ox.ac.uk/~vgg/data/pets/
- Stanford Cars - 16K images, 196 car models - https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset
- FGVC Aircraft - 10.2K images, 100 aircraft variants - https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/
- Food-101 - 101 food categories - https://www.kaggle.com/datasets/dansbecker/food-101
- CUB-200-2011 - 12K bird images, 200 species - https://www.vision.caltech.edu/datasets/cub_200_2011/
- Stanford Dogs - 20K images, 120 dog breeds - http://vision.stanford.edu/aditya86/ImageNetDogs/
- MIT Indoor Scenes - 15K images, 67 indoor categories - http://web.mit.edu/torralba/www/indoor.html
- PASCAL VOC 2012 - 11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
- MS COCO - 328K images, 80 object classes, 91 stuff categories, 5 captions per image, 250k people with keypoints https://cocodataset.org/
- Cityscapes - 5K fine/25K coarse annotations, 8 classes - https://www.cityscapes-dataset.com/, https://www.cityscapes-dataset.com/dataset-overview/#class-definitions
- ADE20K - 25K images, 150 classes - https://groups.csail.mit.edu/vision/datasets/ADE20K/
- PASCAL Context - 10K images, 459 classes - https://cs.stanford.edu/~roozbeh/pascal-context/
- SBD (Semantic Boundaries) - 11K images from PASCAL - https://paperswithcode.com/dataset/sbd
- NYUDv2 - 1.4K RGB-D images - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- SUN RGB-D - 10K RGB-D images - https://rgbd.cs.princeton.edu/
- KITTI Semantic - http://www.cvlibs.net/datasets/kitti/
- PASCAL VOC 2012 - 10K/11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/
- MS COCO - 328K images, 80 classes, 1.5M instances - https://cocodataset.org/
- KITTI Object - http://www.cvlibs.net/datasets/kitti/
- Open Images (v1 in 2016) - 15.8 images, 6000 classes - https://storage.googleapis.com/openimages/web/index.html
- WIDER Face - 32K images, 393K face annotations - http://shuoyang1213.me/WIDERFACE/
- NYUDv2 - 1.4K RGB-D scenes - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- KITTI Depth- http://www.cvlibs.net/datasets/kitti/
- Make3D - 534 images with depths - http://make3d.cs.cornell.edu/data.html
- Sintel - http://sintel.is.tue.mpg.de/
- KITTI Flow - http://www.cvlibs.net/datasets/kitti/
- Flying Chairs - 22K synthetic pairs - https://lmb.informatik.uni-freiburg.de/resources/datasets/FlyingChairs.en.html
- Middlebury - Small but precise benchmark - https://vision.middlebury.edu/flow/
- MPII Human Pose - 25K images, 40K people - http://human-pose.mpi-inf.mpg.de/
- FLIC - 5003 images from movies - https://bensapp.github.io/flic-dataset.html
- Leeds Sports Pose - https://www.kaggle.com/datasets/dkrivosic/leeds-sports-pose-lsp
- LFW (Labeled Faces in the Wild) - 13K images, 5.7K people -https://www.kaggle.com/datasets/jessicali9530/lfw-dataset
- CelebA - 200K images, 10K identities - http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
- MegaFace - 1M images, 690K identities - http://megaface.cs.washington.edu/
- VGGFace - 2.6K people - https://www.robots.ox.ac.uk/~vgg/data/vgg_face/
- UCF-101 - 13,320 videos, 101 actions - https://www.crcv.ucf.edu/data/UCF101.php
- HMDB-51 - 6800 videos, 51 actions - https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
- Sports-1M - 1M YouTube videos, 487 sports - https://cs.stanford.edu/people/karpathy/deepvideo/
- ActivityNet - 20K videos, 200 classes - http://activity-net.org/
- WIDER Attribute - http://mmlab.ie.cuhk.edu.hk/projects/WIDERAttribute.html
- Berkeley Attributes - https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/poselets/
- Classic control tasks
  - OpenAI Gym (cartpole, mountaincar, acrobat, etc). I remember this before chatgpt lol maybe I’m old
  - MuJoCo (Multi-joint dynamics with contact) like the halfcheetah, hopper, humanoid, etc. This was typically done in a physics simulation and was popular for PPO.
- Board games
  - Go
  - Chess
  - PyGame
- TORCS
- Minecraft
- ViZDoom
- Atari 2600 from DeepMind
[3] Scaling Laws Paper, Larger pretrained models paper
- "Scaling Laws for Neural Language Models" by Jared Kaplan et al. (2020): https://arxiv.org/abs/2001.08361
- "Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level”: https://arxiv.org/abs/2105.06020
[4] Modern pretraining algorithms Pretext Tasks: Contrastive Learning Methods: Masked Modeling: Multimodal Learning:
- Rotation prediction
- Jigsaw puzzles
- Colorization
- Inpainting/Masked patches
- SimCLR (Chen et al., 2020): "A Simple Framework for Contrastive Learning of Visual Representations" [2002.05709] A Simple Framework for Contrastive Learning of Visual Representations
- MoCo v1 & v2 (He et al., 2019/2020): "Momentum Contrast for Unsupervised Visual Representation Learning" [2003.04297] Improved Baselines with Momentum Contrastive Learning
- BYOL (Grill et al., 2020): "Bootstrap Your Own Latent"
- PIRL (Misra & van der Maaten, 2020): "Self-Supervised Learning of Pretext-Invariant Representations" Self-Supervised Learning of Pretext-Invariant Representations
- Masked Language Modeling (MLM): BERT (Devlin et al., 2018)
- Masked Autoencoder (MAE)
- CLIP (Radford et al., 2021): "Learning Transferable Visual Models From Natural Language Supervision" [2103.00020] Learning Transferable Visual Models From Natural Language Supervision
- ALIGN (Jia et al., 2021)
- DALL-E (Ramesh et al., 2021): "Zero-Shot Text-to-Image Generation"
[5] Pretraining datasets
- JFT-300M: google’s internal 300M images psudeo labeled: https://ar5iv.labs.arxiv.org/html/1707.02968 (TO VERIFY)
- LAION-5B: 5.85 billion (image, text) pairs scraped from Common Crawl
- CLIP Training Data: 400M (image, text) pairs https://arxiv.org/abs/2103.00020 (not released)
- Wikipedia: English 20GB
- Kinetics-700: 650k videos (technically has action classes but still used)
[6] Improving Language Understanding by Generative Pre-Training https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[7] Video generation models as world simulators: https://openai.com/index/video-generation-models-as-world-simulators/
[8] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism https://arxiv.org/abs/2401.02954
[9] Constitutional AI: Harmlessness from AI Feedback https://arxiv.org/abs/2212.08073
[10] Segment anything: https://arxiv.org/abs/2304.02643, SAM 2: Segment Anything In Images & Videos https://arxiv.org/pdf/2408.00714. More details below.
[11] https://techcrunch.com/2025/06/13/new-details-emerge-on-metas-14-3b-deal-for-scale/
[12] https://www.nature.com/articles/s41586-025-09227-0
[13] "Learning Transferable Visual Models From Natural Language Supervision” https://arxiv.org/abs/2103.00020
[14] "Zero-Shot Text-to-Image Generation” https://arxiv.org/abs/2102.12092
[15] "Emerging Properties in Self-Supervised Vision Transformers” https://arxiv.org/abs/2104.14294, "DINOv2: Learning Robust Visual Features without Supervision” https://arxiv.org/abs/2304.07193
[16] “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” https://arxiv.org/abs/1810.04805
[17] Waymo E2E Open dataset https://waymo.com/open/data/e2e#camera-data

数据规模并非你所需的一切