要約

AI企業のドグマは、より多くのデータがより良いパフォーマンスにつながるというものですが、実際にはデータの規模がすべてではありません。高品質なデータは、より大規模な低品質データセットと比較して、より良いパフォーマンスをもたらします。高品質なデータを生成するには、ノイズをフィルタリングし、ラベル付けされていないデータを理解し、何をラベル付けすべきかを理解する必要があります。アノテーションプラットフォームによる大規模なデータラベリングも問題があります。なぜなら、彼らのインセンティブはしばしば一致せず、彼らのプラットフォームは時間がかかり、エラーが発生しやすく、コストがかかるボトルネックだからです。AIシステムを改善する最良の方法は、自己教師あり表現学習、基盤モデリング、フィルタリングを使用して、インタラクティブな方法でデータセットをインテリジェントに表現することにより、モデルに供給されるデータを理解することです。これらの実践は、AIシステムのパフォーマンス低下のリスクと有害な出力を生成するリスクを防ぎます。

少ないほど良い

データの規模がすべてではありません。モデルを事前学習する際にデータセットのサイズを盲目的に増やすことは、AI企業が深刻なエラーを犯すリスクにさらされます。未知の分布を持つ大規模なデータセットでモデルを訓練すると、予期しない動作につながります。ロボティクスでは誤った危険な軌道につながる可能性があり、医療企業では不正確なリスク評価につながり、LLMでは有害な発言の生成につながります{9}。X上で、Grokはこの間違いを犯し、図0aに示されている現在削除された投稿で有害な発言を生成しました。xAIのCEOでさえ、「インターネット全体で訓練するのではなく、訓練データについてもっと選択的である必要がある」と認めました。しかし、これらのモデルを適切に訓練し評価するために、どのようにデータを適切に選択すればよいのでしょうか?どのようなツールがあるのでしょうか?

解決策は、インタラクティブで意味的に十分に多様な形式でデータをインテリジェントに表現することです。このアプローチは、1. 事前学習と事後学習の両方のための訓練データセットと評価データセットを作成し、2. データの穴を特定し、3. それらのギャップを埋める方法(購入または収集のいずれか)について推奨を行うのに役立ちます。

Figure 0a: Examples of an LLM generating harmful speech likely due to existence of similar text in the training data the xAI team used to train Grok.

Figure 0b: Reaction from the xAI CEO after Grok generated harmful speech. The interesting piece is the teams focus on being selective of the training data. Original post from the Grok CEO https://x.com/elonmusk/status/1944132781745090819

データフライホイール{10}とアノテーション企業

業界では、AI企業のCEO、AI研究者、エンジニアのほとんどが、データフライホイールに統合される現代のアノテーション企業に不満を持っています。

AI企業の現在の定番ソリューションは、事前学習のために大規模なラベル付けされていないデータセットを蓄積し(またはオープンソースの事前学習済みモデルを使用し)、次に意図されたタスクに特化した別の大規模なデータセットにラベルを付け、最後に訓練セットと評価セットを手作業でキュレーションすることです。ラベリングは通常、データエンジンに統合されるアノテーション企業(ScaleAI、SuperAnnotate、Labelboxなど)に外注されます。しかし、大規模なデータセット内のすべてにラベルを付けることはうまく機能しません。なぜなら、データラベリングを数百万または数十億の例にスケーリングすることは、エラーが発生しやすく、持続不可能なほどコストがかかり、時間がかかり、AI企業を不満にさせるからです。さらに重要なことは、ラベリングループは終わりのないプロセスであるということです。データフライホイールは進化するモデルとより多く収集されたデータに継続的に適応するため、ラベリング要件は流動的で時間とともに変化します。アノテーション企業は変化の速度に追いつくことができません。モデルの更新は数週間で行われる可能性がありますが、ラベリングには数か月かかる可能性があります。

データエンジンにおける現代のラベリングループは次のとおりです。

いくつかのデータを収集します。
いくつかのラベリング仕様を設計または更新します。
データと仕様をいくつかのラベリング企業(Scale、SuperAnnotateなど)に送信します。ラベリングの料金を支払います。
ラベリング企業と反復し、モデルを訓練します。
結果を観察し、ステップ2〜5を無期限に繰り返します。

たとえば、自動運転企業は停止標識にラベルを付けたいと考えるかもしれませんが、100万の停止標識にラベルを付けて結果を見た後、停止標識の「視認性」にラベルを付けたいことに気付き、次に停止標識を囲んでいる可能性のある木にもラベルを付けたいと考え、「隠蔽された」ラベルを追加します。今、すべてのデータ(データ収集が継続的であるため、その間に増加しています)を再ラベル付けする必要があります!企業がモデルを改善している間、サイクルは決して終わりません!

MetaがScale.AIのCEOを雇用するために49%の株式に対して143億ドルを費やしたこと[11]は、ラベリング企業とのこれらの困難のために、同社がこれまでに行った最もリスクの高い動きの1つかもしれません。

では、膨大なデータセットで盲目的に訓練することが問題であり、すべてにラベルを付けることが困難である場合、他に何をすべきでしょうか?この問題に過去4年間取り組んできた結果、最良の解決策は、データ内に何があるか、そのデータがモデルにどのように影響するかを選択し理解しやすくするために、データを十分に表現することであることがわかりました。例を迅速に検索し、モデルをテストするための評価セットを迅速に構築できる方法で、データとチャットできるようにする必要があります。

それが私たちがInterpret AIで構築しているものです。私たちは、AIシステムを構築する企業がデータセットとインタラクトし理解できるようにするデータ内省プラットフォーム、データキュレーションプラットフォーム、インテリジェントデータマーケットプレイスを構築しています。私たちは、自然言語、音声、画像、動画を使用してデータとチャットし、類似のインスタンスを検索できる世界を想像しています。これにより、企業はモデルを動かしているデータ(またはデータのギャップ)を信頼し、知ることができます。(これらのいずれかがあなたに共鳴する場合は、ily@interpretai.techまでお気軽にご連絡ください)

まず役立つ可能性のあるものをスケールする

従来のデータフライホイール

Figure 1a: The traditional data engine powering AI solutions in companies.

企業には、データセットにデータを継続的に収集するインフラストラクチャがあります(1b)。次に、チームはヒューリスティックなデータサブセットを作成し、ラベル付けされればモデルが改善されることを期待します(1a)。
データはラベリング(アノテーション)企業に送信されます。ラベリング企業はラベル(アノテーション)を生成し、チームによってレビューされます。これには収束するまでに数か月のやり取りがかかる場合があります。
事前学習済みAIモデルが事前学習されます。
事前学習済みモデルは、ラベリング企業からのラベルを使用してファインチューニングされます。
最終モデルは、企業の評価システムを使用して評価され、メトリクスが生成されます。
次に、企業はこのフィードバックを使用して、他のデータサブセットを選択したり、ラベリング要件を更新したり、モデルの変更を行ったりする可能性があります。この時点で、データセットサブセットはすでに古くなっていることに注意してください。

注:メトリクスは、コストと時間効率の両方でチームからの継続的な反復を必要とする不適切なアノテーションによって歪められる可能性があります(6)。

Figure 1b: A breakdown of the time requirements for different processes in a traditional company’s approach to solutions. Notice that the major bottleneck is getting labels from a labeling company.

図1b:従来の企業のAIシステムの時間制約とセットアップ、これらの各部分を独立して反復するためのおおよそのタイムライン。ラベリング企業がループに入っていると、AIモデルを適切に改善するラベルを生成するのに数か月の反復がかかることに注意してください。これらの各部分が従来の企業とどのように相互作用するかについては、図1aを参照してください。

Interpret AIのデータフライホイール:

深いデータインサイトで知ることから始める

Figure 2a: Interpret’s AI data flywheel & how we provide immediate data insights.

図2a:Interpret AIのデータフライホイール。

事前学習と訓練のための即座のデータサブセット推奨と強化されたデータ提案(それぞれ1aと1b)。
チームは現在、ラベリング企業に送信する前に、Interpretによって提案された大幅に小さいデータサブセットをレビューします。これらのデータサブセットは流動的であり、データが変化するにつれて継続的に更新されます(オプションで、企業がベースラインモデルを統合する場合、Interpret AIはデータがモデルのパフォーマンスにどのように影響するかについてより多くのインサイトを提供できます)。
ラベリング企業とのやり取りは、数か月から数週間に短縮され、アノテーション仕様とデータセットの選択が明確であるため、大幅に安価になります。

フィードバックはモデルに焦点を当てています(6)。
最後に、Interpret AIはデータスペースを分析して、モデルの改善を加速するために収集または購入するデータに関するインサイトを提供します。

Figure 2b: A breakdown of the time requirements for different processes in using Interpret’s platform. On the left hand side feedback iteration speed in green is accelerated. Notice there is no more bottleneck.

図2b:この図は、Interpret AIが顧客と直接統合して、モデルの訓練、データのトリアージと理解、評価を加速する方法を示しています。Interpret AIは次のソリューションを提供します。

既存のデータ分布の理解。
データギャップと相関するモデルギャップの特定。
データギャップを埋めるためのデータの購入とキュレーション。

ユースケース

私たちは、ロボティクス、医療、エージェントLLM業界のいくつかの企業と協力しています。これらのいずれかがあなたに共鳴する場合は、ily@interpretai.techまでお気軽にご連絡ください。

医療

HealthCoは、患者の心血管疾患のリスクを予測しようとしています。

訓練のため

Interpret AIは、解釈基盤モデルを使用して心血管データを分析し、EHR、画像、利用可能な場合は潜在的にECGデータ[12]を処理します。
Interpret AIは、HealthCoの異常または「穴」に気付き、これらの人々の人口統計を説明します(つまり、女性、中年、子供なし、歴史的にトリメタジジンを処方されている)。
これらの検出された記録は、専門家によってさらに分析されます。選択されたデータは、更新、無視、歴史的にトリメタジジンを処方された人々のより多くのデータの購入を支援するために使用、またはこの特定のグループに注釈を付けるためにラベリング企業に送信できます。
選択されたデータは、AI心血管疾患モデルの訓練に使用されます。HealthCoが心血管モデルをInterpretプラットフォームに統合すると、モデルがリアルタイムでパフォーマンスが低い場所をさらに分析し、即座の内省を可能にします。
このプロセスにより、モデルの訓練タイムラインが数か月から数週間に短縮され、AIシステムが急速に改善され、コストが節約されます!

安全性のため

HealthCoに心臓発作を起こした人々の例があり、同様にリスクがある可能性のある他の人々のEHRを分析したいとします。

Interpret AIを使用して、HealthCoはこの人の例を選択し、関連する人々のプールを検索し、信頼度で並べ替えることができます。
これらの人々はリスクがあるとフラグを立てることができ、数百万の記録から数百人のリスクのある人々を迅速に特定できます!

ロボティクス

DriveCoは、子供たちが外で遊ぶためのおもちゃとして自律レースカーを構築しています。

訓練のため

Interpret AIは、収集されたレースカーのビデオデータの実行を分析します。Interpret AIはデータレポートを提供します。
Interpret AIは、ビデオからのリプレイの大部分が地理的に多様ではなく、裏庭の屋外を走行するレースカーの例がほとんどないことに気付きます。
Interpret AIは、DriveCoチームに屋外ビデオのより多くの例を収集することを推奨します。また、この不均衡を軽減するために、Interpret AI基盤モデルを使用して学習された方法でデータセットのバランスを取ろうとします。
- Interpret AIがなければ、DriveCoは必要のないオブジェクトにラベルを付けるために1000時間以上のレースカーデータを送信していたかもしれません!今では10時間だけラベルを付ける必要があります!

安全性のため

これらの自律レースカーが乳児の安全性について精査に直面しているとします。

DriveCoは、「赤ちゃん」を含むビデオのデータベースを検索して、このデータがあるかどうかを確認できます。
DriveCoがデータを持っていない場合、これはチームにそれを収集するように通知します(おそらく偽の赤ちゃんを使用することを願っています)、またはこれによりDriveCoは消費者と投資家に製品が実際に赤ちゃんの周りで安全であることを示すことができます!

ここに至った経緯

ラベルと事前学習に関する簡単な歴史

2015年、Transformer以前、ほとんどのモデルは非常に特定の問題のサブセットを解決するために訓練されていました:分類、セグメンテーション、物体検出(つまり基礎的な問題)など[1]。ベンチマークは、10kから1Mのオーダーの「やや大きい」ラベル付きデータセットでした。{1}

現代の事前学習は2017年頃に登場し、ゲームを変えました。表現学習から借用して、事前学習は基本的なパラダイムシフトとして登場し、突然ラベル付けされていないデータセットがモデルのパフォーマンスに大きな利益をもたらしました。事前学習に使用されるラベル付けされていないデータセットは、ラベル付きの兄弟と比較して大規模でした[5]。これは他の技術と進歩{2}と組み合わされて、CLIP[13]、DALL-E[14]、DINOv2[15]、BERT[16]などの現代の基盤モデルにつながりました。

次に、OpenAIは、Transformer、事前学習、強化学習の進歩の基盤の上に構築され、GPT(生成事前学習済みTransformer)[6]をリリースしたときにゲームを変えました。Sora[7]、DeepSeek[8]、Anthropic[9]はすべて、パフォーマンスの高いモデルのバックボーンとして大規模なデータセットでの事前学習を使用しています。しかし、その中に隠されているのは、ほとんどの人が話していない鋭い観察です。

事前学習は良い最初のステップですが、これらのモデルのほとんどは、事前学習済みベースの上にさらなる訓練が必要です。これがRLであろうと教師あり微調整であろうと、最もパフォーマンスの高いモデルは、元の問題に何らかの形で整合{3}されています。しかし、微調整もある程度までスケールするため、事前学習の改善は将来のモデルのパフォーマンスに不可欠です{4}。

文献で事前学習を適切に統合し、データフライホイールを構築する方法の最も説得力のある例の1つは、MetaがSegment Anything Model(SAM)とSAM v2[10]で構築したラベル付きデータフライホイールです。しかし、この例でさえ、データラベリングは非常にスケールが困難です。

Segment Anything:イノベーションとメッセージ

要約:SAMが示しているのは、品質保証とデータ内に何があるかを理解することは困難ですが、対処すべき重要な問題であるということです。より多くのデータを追加することは必ずしも答えではありません。

SAMは、訓練のさまざまな段階で部分的に訓練されたSAMを人間のラベルフィードバックとともに使用して、大規模なラベル付きデータセットをキュレーションするデータフライホイールを構築しました。彼らのアプローチは、ラベリングをパイプラインに統合する適切な方法を示していますが、適切なデータラベリングフライホイールでさえコストがかかり、スケールが困難であることも強調しています。ある時点で、データセットは十分に大きくなり、人間がすべてに注釈を付けることができなくなるため、他の内省方法(つまり、Interpretが構築しているもの)が必要になります。

大まかに、SAMのアプローチは次のとおりでした[10]

MAE事前学習済み階層ViTから始めます。
公開されているセグメンテーションデータセットでSAMを訓練します。
部分的に訓練されたSAMを使用して、データサブセットでセグメンテーションマスクを生成します。
人間にセグメンテーション予測を洗練させます。次に、マスクを使用してオブジェクト検出器を訓練し、より多くのオブジェクトを見つけ、人間にそれにラベルを付けさせます。
ステップ3〜4を繰り返し、データセットのサイズを徐々に増やします。
10億の画像で実行してSA-1Bを取得して終了します。QAチームを使用して、潜在的に悪い例にフラグを立てます。10億の画像すべてに人間のラベルを提供することは非常に困難であることに注意してください。

SAM 2のアイデアも同じで、これはビデオセグメンテーションモデルであり、50.9Kのビデオにわたって3550万のマスクを持つSA-Vデータセットを生成し、どのビデオセグメンテーションデータセットよりも53倍多くのマスクを持っています[10]。

最高のセグメンテーションモデルは、ラベルフィードバックがすべて迅速で効率的なデータフライホイールで適切に結合されたタスクに直接関連するデータで訓練されたことに注意してください。事前学習と**オープンソースのセグメンテーションデータセットのコレクションでの訓練は、最初と2番目のステップにすぎませんでした。

また、人間のラベリングが最終的に天井に達したことに注意してください。データフライホイールが10億の画像のラベリングを開始したとき、Metaは依然として悪い例にフラグを立てるためにQAフィルターを実行する必要がありました。論文に基づくと、11億のマスクすべてに注釈を付けるには、51,000日の注釈時間がかかったでしょう!{5}

これはMetaの話ですが、ほとんどの企業にとってそれを雇うことは法外に高価で実行不可能でしょう!{6}この規模でのラベリングは単に困難です!

要約を繰り返すと、SAMが示しているのは、品質保証とデータ内に何があるかを理解することは困難ですが、対処すべき重要な問題であるということです。これは、今日の業界で見られる根本的なギャップです。事前学習または微調整に使用されるより多くのデータは必ずしも答えではありません。適切なアプローチは、モデルがどこで苦しんでいるかを特定し、なぜそこで苦しんでいるかを理解し、問題に関連するデータ(またはデータギャップ)を強調することです。これが私たちがInterpret AIで行っていることです。

アノテーション企業の目標は必ずしもあなたの目標と一致していません...

私たちはMAANGでの業界経験があり、私たちのチームはScale、SuperAnnotateなどのアノテーション企業と協力した経験があります。ほとんどのラベリング(アノテーション)企業にとって、ビジネスモデルは次のとおりです。

企業が独自のラベリング(アノテーション)仕様を生成できるようにし、ラベルの複雑さに応じておそらくいくつかのやり取りを行います。
ほとんどのアノテーション企業には、異なる階層のアノテーターがいます。最大のプールは、すべてにラベルを付ける非専門家であり、最小のプールはその分野の専門家(つまり医師)です。次に、アノテーション企業は人間のラベラーのプールを編成し、通常は最も安価なものから始めて低品質の最初のパスを行います。
次に、アノテーターは、企業の複雑なアノテーション仕様に従って、できる限りラベルを付け、アノテーションごとに課金します。
アノテーションにフィードバックと更新を提供し、場合によってはアノテーション仕様を更新します。

このプロセスには4つの主な問題があります。

アノテーションは一貫性がなく、通常は適切なラベラーに割り当てられません。
ラベリングは時間がかかり、高価です。
アノテーションを修正するためのフィードバックループはエラーが発生しやすいです。
アノテーション仕様は、モデルのパフォーマンスが変化するにつれて時間とともに変化します。

1.に対処すると、ラベラーは割り当てられたラベリングタスクに適しているとは限らず、しばしば仲間とは異なるラベルを付けます。たとえば、医療企業の場合、タスクが「患者を最もよく診断する臨床応答を選択する」である場合、これらのラベラーはタスクに適した医師でさえない可能性があります!さらに、自動運転企業の場合、タスクが「停止標識の境界ボックスを描く」である場合、これにはポールが含まれますか?停止標識の裏側の場合はどうですか?異なるアノテーターは、互いに相談せずに異なるラベルを付けます。

2.に対処すると、アノテーションごとに課金することは理論的には素晴らしいように聞こえます。従来のドグマは、より多くのラベルが役立つというものですが、企業がモデルのパフォーマンスを向上させるのに十分な数のラベルのコストを負担できる場合に限ります。この数は通常不明です。これらのアノテーションには通常エラーがあり、AI企業はアノテーションをレビューする内部システムを構築する必要があり、時間(数か月のオーダー)とより多くのお金がかかります。

3.に対処すると、フィードバックループも一貫していません。通常、アノテーション検証の責任はAI企業に押し付けられ、独自の内部監視システムをセットアップする必要があります(すでに時間がかかり、コストがかかります)。AI企業がアノテーションの問題に気付いたとき、修正は問題のあるラベルを作成した同じアノテーターからのものであるとは限らず、アノテーション企業は修正する代わりに問題のある例全体を再ラベル付けすることがあり、これにはより多くのコストがかかります。たとえば、自動運転企業は信号機と人のインスタンスマスクにラベルを付けたい場合があります。このダミーの例では、最初のアノテーターがミスを犯し、カメラに向いていない信号機にラベルを付けるのを忘れます。AI企業はそれにフラグを立て、再レビューのために送信しますが、アノテーション企業がこれを修正する方法は、画像を新しいアノテーターに送信してすべてをゼロから再ラベル付けすることです!2番目のアノテーターは元の問題を修正しますが、警察官を「人」としてラベル付けせず、新しい問題が発生します!図3aと図3bを参照してください。このループは、50のラベルに対してオブジェクトを正しく注釈する確率が非常に低い〜61%です{7}。

Figure 3a: First pass by the first annotator who missed the traffic lights that are not facing the camera. (Image from Waymo Open Dataset [17])

Figure 3b: Second pass from the second annotator who got all the traffic lights but didn’t realize that the “people” class included police officers! (Image from Waymo Open Dataset [17])

基本的に、このフィードバックシステムでは、アノテーション企業が作成するラベルが正しいラベルに収束することは保証されていません!

AI企業のインセンティブは、ラベリング企業のインセンティブとうまく一致していません。AI企業はAIモデルと製品を改善したいと考えていますが、アノテーション企業はできるだけ多くの企業データにラベルを付けて課金したいと考えています。あなたはモデルをパフォーマンスの高いものにしたいと考えており、アノテーション企業もそうすべきです。

4.に対処すると、業界(および研究)では、問題を解決しようとするとき、多くの可能な解決策があります。おそ

For instance, take labeling stop signs in autonomous driving; suppose that we first label stop signs. We notice that performance improves when we know if a stop sign is partially obstructed, so we update the annotation spec to add a metadata tag called “obstructed” later on when the sign is partially or not visible. We then go back to an annotation company and ask them to relabel all our stop signs with this! This “annotation-platform in the loop” means that every model experiment that updates the labeled dataset is super expensive!

So, one may wonder, why are labeling providers used at all? For two reasons: First, high quality labels on data do help as discussed earlier. In fact, less data with higher quality labels can outperform some of these large pretrained models; SAM being an excellent example. Second, the alternative to not using an annotation company is to create an internal annotation platform which is even more expensive and time consuming, since producing the same volume of labels as the other players can take years!

Conclusion

The optimal data flywheel represents data in a form that’s inherently insightful and interactable: we should be able to detect anomalies and also chat with our data to garner interesting patterns and insights. This flywheel should enhance annotation platforms by focusing on what should be labeled instead of labeling everything {8}. And finally, this data flywheel should align with model performance, tying directly to whatever problem your AI company is solving.

The traditional dogma is that more data “just works” and sometimes deep learning feels like alchemy. Perhaps more data will work for you in the short run but when things “just don’t work” the proper way is to assess failure both in the data & the model and work from there.

Over at Interpret we hope to change the paradigm. If you are interested, reach out to us at ily@interpretai.tech

Footnotes

Back when AlexNet was still a thing circa 2015ish most models for computer vision were trained on a subset of very particular problem types: classification, segmentation, object detection (ie foundation problems) and others like image captioning, scene recognition, pose estimation (see appendix for more details)[1]. Note this was pre “Attention is all you need” when bigrams were a-la-mode. The focus then was model development while benchmarks remained fixed. These benchmarks were “largish” labeled datasets (order of 10k to 1M) that were used to evaluate model performance. Some of the popular CV benchmarks you’re probably familiar with are MNIST, ImageNet, MS COCO, KITTI, Caltech-101 [2]. If you look the largest labeled datasets around this time they were around 1M labels, and that was considered large at the time.
Modern pretraining entered the chat around 2017 and changed the game. Borrowing from representation learning, pretraining came as a fundamental paradigm shift from learning features for only a specific labeled dataset to learning general features on unlabeled data that correlated well with other problems like classification, segmentation, object detection. These datasets compared to their labeled brethern were massive [5]. At the same time, advancements in model training (CUDA optimization which is why NVIDIA hit a 4T market cap), deep learning libraries (tensorflow, pytroch), and new / improved model architectures like Transformers from “Attention Is All You Need” opened up a brand new world. Researchers also noticed that increasing the size of models typically correlated with improved performance on unseen data (from the same data distribution). All of this combined interfaced with modern pretraining algorithms like pretext tasks, contrastive learning, masked label modeling, masked autoencoding (MAE) multimodal modeling [4] unlocking the era of training big models on even massive unlabeled datasets. Ergo, models like CLIP [13], DALL-E [14], DINOv2 [15], BERT [16].
”Alignment” is an overused term I mean alignment in both the “we want our LLM to be helpful not harmful” sense and the “data distribution alignment” sense.
When training / fine-tuning a model, scaling model size correlates with improvement in performance roughly following a power law. In industry, we’re already hitting the peak for model size scaling laws and fine-tuning is giving less and less of an advantage. The next frontier is improving pretraining method to better utilize existing unlabeled datasets.
In the SAM paper, annotations could take 30 seconds (but suppose it took 4 seconds based on the improvements from SAM v2 [10]); reviewing 1.1B masks would’ve required 1,100,000,000 * 4 seconds = ~51,000 days of annotation time!
This is also assuming that the data distribution is stationary (unchanging). If we wanted to increase the labels to a different data distribution (say deep sea diving videos where the semantics & dynamics of objects is different) then finetuning SAM would still require the same data flywheel training process which is also more time and more money.
Suppose that each object has a probability of being mislabeled p=0.01 (ie an annotator labels incorrectly or misses a label once every 100 labels). Assuming 50 objects in a video the probability of succeeding assuming independence is (1 - p)^50 = 61% chance of success! And that’s conservative.
Fundamentally, when AI companies have better clarity on what to label their incentives align with annotation companies.
More and more it is clear very few samples (e.g. thousands) of very high quality data is way better than million of low quality data - this is particularly true in post-traning of LLMs in industry but it is starting to be the focus also of pre-training.
A data flywheel is the loop used to collect data, improve the model, which makes a better product, which then modifies what data to collect and the cycle repeats (for example this image from dataloop.ai https://dataloop.ai/book/the-data-flywheel-effect/). A data engine is the infra for collecting/labeling/evaluating data (for example Scale’s product https://scale.com/data-engine).

Special Thanks

Cameron Tukerman-Lee (also credit for the title)
Gabriele Sorrento
Francesco Pongetti
Lotfi Herzi

Appendix

[1] A more extensive list of popular 2015 foundational problems across different domains so sortof pre multi-modal.
- Computer vision
  - classification
  - segmentation
  - object detection
  - image captioning
  - scene recognition
  - pose estimation
  - Optical Flow Estimation
  - Depth Estimation
  - Face recognition
  - Pose estimation
  - Visual tracking
  - Style transfer
  - Image generation
- Natural Language Processing
  - Machine translation
  - Part of speech tagging
  - Question answering
- Speech Processing
  - Speech recognition
  - Speaker identification
  - Emotion classification
- Time series
- Reinforcement Learning
[2] Popular datasets separated by domain around 2015 Classification: Segmentation: Object Detection: Other Tasks: Depth Estimation: Optical Flow: Pose Estimation: Face Recognition: Video/Action Recognition: Attributes/Multi-label: Reinforcement Learning: Can think of dataset size as number of rollouts.
- ImageNet (ILSVRC 2017) - 1.2M training, 1000 classes - https://www.image-net.org/challenges/LSVRC/2017/index.php
- CIFAR-10/100 - 60K (32x32), 10/100 classes - https://www.cs.toronto.edu/~kriz/cifar.html
- MNIST - 70K handwritten digits - https://www.kaggle.com/datasets/hojjatk/mnist-dataset
- Fashion-MNIST - 70K fashion items - https://github.com/zalandoresearch/fashion-mnist
- SVHN - 600K real world house numbers 10 classes for each digit - http://ufldl.stanford.edu/housenumbers/
- Caltech-101/256 - 9K/30K images 101/256 categories - https://data.caltech.edu/records/mzrjq-6wc02, https://data.caltech.edu/records/nyy15-4j048
- Oxford Flowers 102 - 102 categories - https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
- Oxford-IIIT Pets - 7.4K images, 37 pet breeds - https://www.robots.ox.ac.uk/~vgg/data/pets/
- Stanford Cars - 16K images, 196 car models - https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset
- FGVC Aircraft - 10.2K images, 100 aircraft variants - https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/
- Food-101 - 101 food categories - https://www.kaggle.com/datasets/dansbecker/food-101
- CUB-200-2011 - 12K bird images, 200 species - https://www.vision.caltech.edu/datasets/cub_200_2011/
- Stanford Dogs - 20K images, 120 dog breeds - http://vision.stanford.edu/aditya86/ImageNetDogs/
- MIT Indoor Scenes - 15K images, 67 indoor categories - http://web.mit.edu/torralba/www/indoor.html
- PASCAL VOC 2012 - 11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
- MS COCO - 328K images, 80 object classes, 91 stuff categories, 5 captions per image, 250k people with keypoints https://cocodataset.org/
- Cityscapes - 5K fine/25K coarse annotations, 8 classes - https://www.cityscapes-dataset.com/, https://www.cityscapes-dataset.com/dataset-overview/#class-definitions
- ADE20K - 25K images, 150 classes - https://groups.csail.mit.edu/vision/datasets/ADE20K/
- PASCAL Context - 10K images, 459 classes - https://cs.stanford.edu/~roozbeh/pascal-context/
- SBD (Semantic Boundaries) - 11K images from PASCAL - https://paperswithcode.com/dataset/sbd
- NYUDv2 - 1.4K RGB-D images - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- SUN RGB-D - 10K RGB-D images - https://rgbd.cs.princeton.edu/
- KITTI Semantic - http://www.cvlibs.net/datasets/kitti/
- PASCAL VOC 2012 - 10K/11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/
- MS COCO - 328K images, 80 classes, 1.5M instances - https://cocodataset.org/
- KITTI Object - http://www.cvlibs.net/datasets/kitti/
- Open Images (v1 in 2016) - 15.8 images, 6000 classes - https://storage.googleapis.com/openimages/web/index.html
- WIDER Face - 32K images, 393K face annotations - http://shuoyang1213.me/WIDERFACE/
- NYUDv2 - 1.4K RGB-D scenes - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- KITTI Depth- http://www.cvlibs.net/datasets/kitti/
- Make3D - 534 images with depths - http://make3d.cs.cornell.edu/data.html
- Sintel - http://sintel.is.tue.mpg.de/
- KITTI Flow - http://www.cvlibs.net/datasets/kitti/
- Flying Chairs - 22K synthetic pairs - https://lmb.informatik.uni-freiburg.de/resources/datasets/FlyingChairs.en.html
- Middlebury - Small but precise benchmark - https://vision.middlebury.edu/flow/
- MPII Human Pose - 25K images, 40K people - http://human-pose.mpi-inf.mpg.de/
- FLIC - 5003 images from movies - https://bensapp.github.io/flic-dataset.html
- Leeds Sports Pose - https://www.kaggle.com/datasets/dkrivosic/leeds-sports-pose-lsp
- LFW (Labeled Faces in the Wild) - 13K images, 5.7K people -https://www.kaggle.com/datasets/jessicali9530/lfw-dataset
- CelebA - 200K images, 10K identities - http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
- MegaFace - 1M images, 690K identities - http://megaface.cs.washington.edu/
- VGGFace - 2.6K people - https://www.robots.ox.ac.uk/~vgg/data/vgg_face/
- UCF-101 - 13,320 videos, 101 actions - https://www.crcv.ucf.edu/data/UCF101.php
- HMDB-51 - 6800 videos, 51 actions - https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
- Sports-1M - 1M YouTube videos, 487 sports - https://cs.stanford.edu/people/karpathy/deepvideo/
- ActivityNet - 20K videos, 200 classes - http://activity-net.org/
- WIDER Attribute - http://mmlab.ie.cuhk.edu.hk/projects/WIDERAttribute.html
- Berkeley Attributes - https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/poselets/
- Classic control tasks
  - OpenAI Gym (cartpole, mountaincar, acrobat, etc). I remember this before chatgpt lol maybe I’m old
  - MuJoCo (Multi-joint dynamics with contact) like the halfcheetah, hopper, humanoid, etc. This was typically done in a physics simulation and was popular for PPO.
- Board games
  - Go
  - Chess
  - PyGame
- TORCS
- Minecraft
- ViZDoom
- Atari 2600 from DeepMind
[3] Scaling Laws Paper, Larger pretrained models paper
- "Scaling Laws for Neural Language Models" by Jared Kaplan et al. (2020): https://arxiv.org/abs/2001.08361
- "Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level”: https://arxiv.org/abs/2105.06020
[4] Modern pretraining algorithms Pretext Tasks: Contrastive Learning Methods: Masked Modeling: Multimodal Learning:
- Rotation prediction
- Jigsaw puzzles
- Colorization
- Inpainting/Masked patches
- SimCLR (Chen et al., 2020): "A Simple Framework for Contrastive Learning of Visual Representations" [2002.05709] A Simple Framework for Contrastive Learning of Visual Representations
- MoCo v1 & v2 (He et al., 2019/2020): "Momentum Contrast for Unsupervised Visual Representation Learning" [2003.04297] Improved Baselines with Momentum Contrastive Learning
- BYOL (Grill et al., 2020): "Bootstrap Your Own Latent"
- PIRL (Misra & van der Maaten, 2020): "Self-Supervised Learning of Pretext-Invariant Representations" Self-Supervised Learning of Pretext-Invariant Representations
- Masked Language Modeling (MLM): BERT (Devlin et al., 2018)
- Masked Autoencoder (MAE)
- CLIP (Radford et al., 2021): "Learning Transferable Visual Models From Natural Language Supervision" [2103.00020] Learning Transferable Visual Models From Natural Language Supervision
- ALIGN (Jia et al., 2021)
- DALL-E (Ramesh et al., 2021): "Zero-Shot Text-to-Image Generation"
[5] Pretraining datasets
- JFT-300M: google’s internal 300M images psudeo labeled: https://ar5iv.labs.arxiv.org/html/1707.02968 (TO VERIFY)
- LAION-5B: 5.85 billion (image, text) pairs scraped from Common Crawl
- CLIP Training Data: 400M (image, text) pairs https://arxiv.org/abs/2103.00020 (not released)
- Wikipedia: English 20GB
- Kinetics-700: 650k videos (technically has action classes but still used)
[6] Improving Language Understanding by Generative Pre-Training https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[7] Video generation models as world simulators: https://openai.com/index/video-generation-models-as-world-simulators/
[8] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism https://arxiv.org/abs/2401.02954
[9] Constitutional AI: Harmlessness from AI Feedback https://arxiv.org/abs/2212.08073
[10] Segment anything: https://arxiv.org/abs/2304.02643, SAM 2: Segment Anything In Images & Videos https://arxiv.org/pdf/2408.00714. More details below.
[11] https://techcrunch.com/2025/06/13/new-details-emerge-on-metas-14-3b-deal-for-scale/
[12] https://www.nature.com/articles/s41586-025-09227-0
[13] "Learning Transferable Visual Models From Natural Language Supervision” https://arxiv.org/abs/2103.00020
[14] "Zero-Shot Text-to-Image Generation” https://arxiv.org/abs/2102.12092
[15] "Emerging Properties in Self-Supervised Vision Transformers” https://arxiv.org/abs/2104.14294, "DINOv2: Learning Robust Visual Features without Supervision” https://arxiv.org/abs/2304.07193
[16] “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” https://arxiv.org/abs/1810.04805
[17] Waymo E2E Open dataset https://waymo.com/open/data/e2e#camera-data

データの規模がすべてではない

要約