الملخص التنفيذي

العقيدة السائدة لدى شركات الذكاء الاصطناعي هي أن المزيد من البيانات يؤدي إلى أداء أفضل، ولكن في الواقع حجم البيانات ليس كل ما تحتاجه. البيانات عالية الجودة تحقق أداءً أفضل مقارنة بمجموعة بيانات أكبر منخفضة الجودة. إنتاج بيانات عالية الجودة يتطلب التصفية من خلال الضوضاء، وفهم البيانات غير المُصنفة، وفهم ما يجب تصنيفه. تصنيف البيانات الضخمة بواسطة منصات التعليق التوضيحي يمثل مشكلة أيضاً حيث أن حوافزها غالباً ما تكون غير متوافقة ومنصتها تشكل عنق زجاجة يستهلك الوقت ومعرض للأخطاء ومكلف. أفضل طريقة لتحسين أنظمة الذكاء الاصطناعي هي فهم البيانات التي تغذي النماذج من خلال تمثيل مجموعات البيانات بطريقة ذكية قابلة للتفاعل باستخدام التعلم التمثيلي ذاتي الإشراف، ونمذجة الأساس، والتصفية. هذه الممارسات تمنع خطر الأداء الضعيف في أنظمة الذكاء الاصطناعي وخطر توليد مخرجات ضارة.

الأقل هو الأكثر

حجم البيانات ليس كل ما تحتاجه. زيادة حجم مجموعة البيانات بشكل أعمى أثناء التدريب المسبق للنموذج يضع الشركات التي تعتمد على الذكاء الاصطناعي في خطر ارتكاب أخطاء جسيمة. تدريب النماذج على مجموعات بيانات كبيرة ذات توزيع غير معروف يؤدي إلى سلوكيات غير متوقعة: في مجال الروبوتات قد يؤدي هذا إلى مسارات خاطئة وخطيرة، ولشركة رعاية صحية تقييمات خاطئة للمخاطر، وبالنسبة لنماذج اللغة الكبيرة توليد خطاب ضار {9}. على منصة X، ارتكب Grok هذا الخطأ، حيث ولّد خطاباً ضاراً في المنشور المحذوف الآن الموضح في الشكل 0a. حتى الرئيس التنفيذي لـ xAI اعترف بأنهم بحاجة إلى أن يكونوا أكثر "انتقائية بشأن بيانات التدريب، بدلاً من مجرد التدريب على الإنترنت بأكمله". ولكن كيف تختار البيانات بشكل صحيح لتدريب وتقييم هذه النماذج بشكل صحيح؟ ما هي الأدوات المتاحة؟

الحل هو تمثيل البيانات بذكاء في شكل قابل للتفاعل ومتنوع بما فيه الكفاية دلالياً. هذا النهج يساعد في: 1. إنشاء مجموعات بيانات التدريب والتقييم لكل من التدريب المسبق والتدريب اللاحق، 2. تحديد الثغرات في البيانات و 3. تقديم توصيات حول كيفية سد تلك الفجوات (إما عن طريق الشراء أو الجمع).

Figure 0a: Examples of an LLM generating harmful speech likely due to existence of similar text in the training data the xAI team used to train Grok.

Figure 0b: Reaction from the xAI CEO after Grok generated harmful speech. The interesting piece is the teams focus on being selective of the training data. Original post from the Grok CEO https://x.com/elonmusk/status/1944132781745090819

دواليب البيانات {10} وشركات التعليق التوضيحي

في الصناعة، معظم الرؤساء التنفيذيين لشركات الذكاء الاصطناعي، وباحثي الذكاء الاصطناعي، والمهندسين غير راضين عن شركات التعليق التوضيحي الحديثة التي تدمج نفسها في دواليب البيانات الخاصة بهم.

الحل الحالي الذي تلجأ إليه شركات الذكاء الاصطناعي هو جمع مجموعة بيانات كبيرة غير مُصنفة للتدريب المسبق (أو استخدام نموذج مُدرب مسبقاً مفتوح المصدر)، ثم تصنيف مجموعة بيانات كبيرة أخرى خاصة بالمهمة المقصودة، وأخيراً تنظيم مجموعة تدريب ومجموعة تقييم يدوياً. عادةً ما يتم الاستعانة بمصادر خارجية للتصنيف إلى شركات التعليق التوضيحي (ScaleAI، SuperAnnotate، Labelbox، إلخ) التي تدمج نفسها في محرك البيانات. لكن تصنيف كل شيء في مجموعة بيانات كبيرة لا يعمل بشكل جيد لأن توسيع نطاق تصنيف البيانات إلى ملايين أو مليارات من الأمثلة معرض للأخطاء ومكلف بشكل غير مستدام ويستهلك الوقت مما يترك شركات الذكاء الاصطناعي غير راضية. والأهم من ذلك، أن حلقة التصنيف هي عملية لا تنتهي أبداً حيث أن دواليب البيانات تتكيف باستمرار مع النماذج المتطورة والمزيد من البيانات المجمعة مما يجعل متطلبات التصنيف متغيرة وتتغير بمرور الوقت؛ شركات التعليق التوضيحي لا تستطيع مواكبة سرعة التغييرات حيث يمكن أن تحدث تحديثات النموذج في أسابيع بينما يمكن أن يستغرق التصنيف شهوراً.

حلقة التصنيف الحديثة في محرك البيانات هي:

جمع بعض البيانات.
تصميم أو تحديث بعض مواصفات التصنيف.
إرسال البيانات والمواصفات إلى شركة تصنيف ما (Scale، SuperAnnotate، إلخ). الدفع مقابل التصنيف.
التكرار مع شركة التصنيف وتدريب النموذج.
ملاحظة النتائج ثم تكرار الخطوات 2-5 إلى أجل غير مسمى.

على سبيل المثال، قد ترغب شركة قيادة ذاتية في تصنيف علامات التوقف ولكن بعد تصنيف مليون علامة توقف ورؤية النتائج يدركون أنهم يريدون تصنيف "رؤية" علامة التوقف، ثم يدركون أنهم يريدون أيضاً تصنيف الأشجار التي قد تحيط بعلامات التوقف مضيفين تصنيف "مُحجوب". الآن جميع البيانات (التي نمت أيضاً في هذه الأثناء حيث أن جمع البيانات مستمر) تحتاج إلى إعادة تصنيف! الدورة لن تنتهي أبداً طالما أن الشركة تعمل على تحسين نموذجها!

إنفاق Meta مبلغ 14.3 مليار دولار للحصول على حصة 49٪ لتوظيف الرئيس التنفيذي لـ Scale.AI [11] قد يكون أحد أكثر التحركات خطورة التي قامت بها الشركة على الإطلاق بسبب هذه الصعوبات مع شركات التصنيف.

إذن، إذا كان التدريب الأعمى على مجموعات بيانات ضخمة يمثل مشكلة، وتصنيف كل شيء صعب، فماذا يجب أن نفعل أيضاً؟ بعد العمل على هذه المسألة خلال السنوات الأربع الماضية، وجدنا أن أفضل حل هو تمثيل البيانات بشكل جيد بما فيه الكفاية بحيث يكون من الأسهل اختيار وفهم ما هو موجود في بياناتنا وكيف تؤثر تلك البيانات على نماذجنا. يجب أن نكون قادرين على التحدث مع بياناتنا بطريقة تتيح لنا البحث بسرعة عن الأمثلة وبناء مجموعات تقييم بسرعة لاختبار النماذج.

هذا ما نبنيه في Interpret AI. نحن نبني منصة استبطان البيانات، ومنصة تنظيم البيانات، وسوق بيانات ذكي يسمح للشركات التي تبني أنظمة الذكاء الاصطناعي بالتفاعل وفهم مجموعات البيانات الخاصة بها. نتصور عالماً يمكنك فيه التحدث مع بياناتك باستخدام اللغة الطبيعية والصوت والصورة والفيديو للبحث عن حالات مماثلة بحيث يمكن للشركات الوثوق ومعرفة بياناتها (أو الفجوات في بياناتها) التي تشغل نماذجها. (إذا كان أي من هذا يتردد صداه معك، فلا تتردد في التواصل مع ily@interpretai.tech)

قياس ما هو مفيد على الأرجح أولاً

دواليب البيانات التقليدية

Figure 1a: The traditional data engine powering AI solutions in companies.

لدى الشركة بعض البنية التحتية التي تجمع البيانات باستمرار في مجموعة بيانات (1b). ثم يقوم الفريق بإنشاء مجموعات فرعية من البيانات الاستدلالية التي نأمل أنه بمجرد تصنيفها ستحسن نموذجهم (1a).
يتم إرسال البيانات إلى شركة التصنيف (التعليق التوضيحي). تنتج شركة التصنيف تصنيفات (تعليقات توضيحية) يتم مراجعتها بعد ذلك من قبل الفريق، والتي يمكن أن تستغرق شهوراً من التبادل للتقارب.
ثم يتم التدريب المسبق لنموذج الذكاء الاصطناعي.
ثم يتم ضبط النموذج المُدرب مسبقاً باستخدام التصنيفات من شركة التصنيف
يتم تقييم النموذج النهائي باستخدام نظام التقييم الخاص بالشركة، مما يولد مقاييس.
تستخدم الشركة بعد ذلك هذه الملاحظات لاختيار مجموعات فرعية أخرى من البيانات، وتحديث متطلبات التصنيف، و/أو إجراء تغييرات على النموذج. لاحظ أنه بحلول هذه النقطة تكون مجموعة البيانات الفرعية قد بدأت بالفعل في التقادم.

ملاحظة: قد تكون المقاييس منحرفة بسبب التعليقات التوضيحية الضعيفة التي تتطلب تكراراً مستمراً من الفريق وهو أمر مكلف وغير فعال من حيث الوقت (6).

Figure 1b: A breakdown of the time requirements for different processes in a traditional company’s approach to solutions. Notice that the major bottleneck is getting labels from a labeling company.

الشكل 1b: قيود الوقت وإعداد نظام الذكاء الاصطناعي للشركة التقليدية مع جداول زمنية تقريبية لتكرار كل من هذه القطع بشكل مستقل. لاحظ أنه مع وجود شركة تصنيف في الحلقة، سيستغرق الأمر شهوراً من التكرار لتوليد تصنيفات تحسن نموذج الذكاء الاصطناعي بشكل صحيح.انظر الشكل 1a لكيفية تفاعل كل من هذه القطع مع شركة تقليدية.

دولاب بيانات Interpret AI:

ابدأ بالمعرفة مع رؤى عميقة للبيانات

Figure 2a: Interpret’s AI data flywheel & how we provide immediate data insights.

الشكل 2a: دولاب بيانات Interpret AI.

توصيات فورية لمجموعات البيانات الفرعية واقتراحات بيانات محسّنة للتدريب المسبق والتدريب (1a و 1b على التوالي).
يقوم الفريق الآن بمراجعة مجموعات فرعية أصغر بكثير من البيانات المقترحة من Interpret قبل إرسالها إلى شركة تصنيف. هذه المجموعات الفرعية من البيانات متغيرة ويتم تحديثها باستمرار مع تغير البيانات (اختيارياً، إذا قامت شركة بدمج نموذجها الأساسي، يمكن لـ Interpret AI تقديم المزيد من الرؤى حول كيفية تأثير البيانات على أداء النموذج).
يتم تسريع التبادل بين شركة التصنيف من شهور إلى أسابيع وهو أرخص بكثير حيث أن مواصفات التعليق التوضيحي واختيار مجموعة البيانات واضحان.

تركز الملاحظات على النموذج (6).
أخيراً، يحلل Interpret AI مساحة بياناتك لتقديم رؤى حول البيانات التي يجب جمعها أو شراؤها لتسريع تحسين النموذج.

Figure 2b: A breakdown of the time requirements for different processes in using Interpret’s platform. On the left hand side feedback iteration speed in green is accelerated. Notice there is no more bottleneck.

الشكل 2b: يوضح الشكل كيف يتكامل Interpret AI مباشرة مع عملائنا لتسريع تدريب النموذج، وفرز البيانات وفهمها، والتقييم. يوفر Interpret AI حلولاً لـ

فهم توزيع البيانات الموجودة.
تحديد فجوات النموذج المرتبطة بفجوات البيانات.
شراء وتنظيم البيانات لسد فجوات البيانات.

حالات الاستخدام

نتعاون مع العديد من الشركات عبر صناعات الروبوتات والرعاية الصحية ونماذج اللغة الكبيرة الوكيلة. إذا كان أي من هذا يتردد صداه معك، فلا تتردد في التواصل مع ily@interpretai.tech

الرعاية الصحية

تحاول HealthCo التنبؤ بخطر الإصابة بأمراض القلب والأوعية الدموية لمرضاها.

للتدريب

يحلل Interpret AI بيانات القلب والأوعية الدموية باستخدام نماذج الأساس الخاصة بنا، ومعالجة السجلات الصحية الإلكترونية والصور وربما بيانات تخطيط القلب [12] إذا كانت متاحة.
يلاحظ Interpret AI شذوذات أو "ثغرات" في HealthCo ويصف الخصائص الديموغرافية لهؤلاء الأشخاص (أي أنثى، في منتصف العمر، بدون أطفال، تم وصف trimetazidine لها تاريخياً).
يتم تحليل هذه السجلات المكتشفة بشكل أكبر من قبل الخبراء. يمكن بعد ذلك تحديث البيانات المحددة أو تجاهلها أو استخدامها للمساعدة في شراء المزيد من البيانات للأشخاص الذين تم وصف trimetazidine لهم تاريخياً، أو إرسالها إلى شركة تصنيف للتعليق على هذه المجموعة المحددة.
يتم بعد ذلك استخدام البيانات المحددة لتدريب نموذج الذكاء الاصطناعي لأمراض القلب والأوعية الدموية. إذا قامت HealthCo بدمج نموذج القلب والأوعية الدموية الخاص بها في منصة Interpret، فإننا نحلل بشكل أكبر أين يكون أداء النموذج ضعيفاً في الوقت الفعلي، مما يسمح بالاستبطان الفوري.
تقلل هذه العملية الجدول الزمني لتدريب النموذج من رتبة شهور إلى أسابيع مما يحسن أنظمة الذكاء الاصطناعي بسرعة ويوفر التكاليف!

للسلامة

لنفترض أن HealthCo لديها أمثلة على أشخاص أصيبوا بنوبات قلبية ويريدون تحليل السجلات الصحية الإلكترونية الأخرى للأشخاص المشابهين لهذا الشخص والذين قد يكونون أيضاً في خطر

باستخدام Interpret AI، يمكن لـ HealthCo اختيار أمثلة لهذا الشخص والبحث عن مجموعة ذات صلة من الأشخاص، والفرز حسب الثقة.
يمكن وضع علامة على هؤلاء الأشخاص على أنهم في خطر، مما يحدد بسرعة بضع مئات من الأشخاص المعرضين للخطر من ملايين السجلات!

الروبوتات

تقوم DriveCo ببناء سيارات سباق ذاتية القيادة كلعبة للأطفال للعب بها في الخارج.

للتدريب

يحلل Interpret AI عمليات التشغيل المجمعة لبيانات فيديو سيارة السباق. يقدم Interpret AI تقرير بيانات.
يلاحظ Interpret AI أن غالبية عمليات إعادة التشغيل من مقاطع الفيديو ليست متنوعة جغرافياً وأن هناك أمثلة قليلة على سيارات السباق التي تقود في الهواء الطلق في الساحات الخلفية.
يوصي Interpret AI فريق DriveCo بجمع المزيد من الأمثلة على مقاطع الفيديو الخارجية. نحاول أيضاً موازنة مجموعة البيانات بطريقة مُتعلمة باستخدام نموذج الأساس Interpret AI الخاص بنا لتخفيف هذا الاختلال.
- بدون Interpret AI، قد تكون DriveCo قد أرسلت أكثر من 1000 ساعة من بيانات سيارة السباق لتصنيف الأشياء التي لم تكن مطلوبة! الآن يحتاجون فقط إلى تصنيف 10 ساعات!

للسلامة

لنفترض أن سيارات السباق ذاتية القيادة هذه تواجه تدقيقاً بشأن سلامة الرضع.

يمكن لـ DriveCo البحث في قاعدة بياناتها عن مقاطع فيديو تحتوي على "طفل" لمعرفة ما إذا كان لديهم هذه البيانات.
إذا لم يكن لدى DriveCo البيانات، فهذا يُعلم الفريق بجمعها (باستخدام أطفال مزيفين آمل) أو يسمح هذا لـ DriveCo بإظهار للمستهلكين والمستثمرين أن المنتج آمن بالفعل حول الأطفال!

كيف وصلنا إلى هنا

تاريخ موجز عن التصنيفات والتدريب المسبق

في عام 2015، قبل المحولات، تم تدريب معظم النماذج لحل مجموعة فرعية محددة جداً من المشاكل: التصنيف، والتجزئة، واكتشاف الأشياء (أي المشاكل الأساسية) وغيرها [1]. كانت المعايير عبارة عن مجموعات بيانات مُصنفة "كبيرة نسبياً" بترتيب 10 آلاف إلى مليون. {1}

دخل التدريب المسبق الحديث إلى الساحة حوالي عام 2017 وغيّر اللعبة. باستعارة من التعلم التمثيلي، جاء التدريب المسبق كتحول نموذجي أساسي حيث فجأة فتحت مجموعات البيانات غير المُصنفة مكاسب ضخمة في أداء النموذج. كانت مجموعات البيانات غير المُصنفة المستخدمة للتدريب المسبق مقارنة بنظيراتها المُصنفة ضخمة [5]. هذا مع تقنياتوتطورات أخرى {2} أدى إلى نماذج أساسية حديثة مثل CLIP [13]، DALL-E [14]، DINOv2 [15]، و BERT [16] على سبيل المثال لا الحصر.

ثم غيّرت OpenAI، المبنية على أساس من المحولات والتدريب المسبق وتقدم التعلم المعزز، اللعبة عندما أطلقت GPT (محول توليدي مُدرب مسبقاً) [6]. Sora [7]، DeepSeek [8]، Anthropic [9] جميعها تستخدم التدريب المسبق على مجموعات بيانات كبيرة كعمود فقري لنماذجها عالية الأداء. لكن المخفي هناك هو ملاحظة حادة لا يتحدث عنها معظم الناس.

بينما يعد التدريب المسبق خطوة أولى جيدة، فإن معظم هذه النماذج تحتاج إلى مزيد من التدريب فوق قاعدة مُدربة مسبقاً. سواء كان هذا تعلماً معززاً أو ضبطاً دقيقاً خاضعاً للإشراف، فإن النماذج الأكثر أداءً مُحاذاة {3} بطريقة ما للمشكلة الأصلية. لكن حتى الضبط الدقيق يتوسع إلى نقطة معينة، مما يعني أن تحسين التدريب المسبق ضروري لأداء النموذج المستقبلي {4}.

أحد الأمثلة الأكثر إقناعاً على كيفية دمج التدريب المسبق بشكل صحيح وبناء دولاب بيانات في الأدبيات هو دولاب البيانات المُصنفة الذي بنته Meta في نموذج تجزئة أي شيء (SAM) و SAM v2 [10]. لكن حتى في هذا المثال، يصعب توسيع نطاق تصنيف البيانات بشكل لا يصدق.

تجزئة أي شيء: الابتكارات والرسالة

الملخص التنفيذي: ما يظهره لنا SAM هو أن ضمان الجودة وفهم ما هو موجود في بياناتنا صعب ولكنه مشكلة مهمة يجبمعالجتها. إضافة المزيد من البيانات ليست بالضرورة الإجابة.

بنى SAM دولاب بيانات نظّم مجموعة بيانات مُصنفة كبيرة باستخدام SAM مُدرب جزئياً في مراحل مختلفة من التدريب مع ملاحظات تصنيف بشرية. يوضح نهجهم الطريقة الصحيحة لدمج التصنيف في خط أنابيب ولكنه يسلط الضوء أيضاً على أن حتى دولاب بيانات التصنيف الصحيح مكلف وصعب التوسع. في مرحلة ما، تنمو مجموعة البيانات بشكل كبير بما فيه الكفاية حيث لا يستطيع البشر التعليق على كل شيء وبالتالي يتطلب طريقة أخرى للاستبطان (أي ما يبنيه Interpret).

تقريباً، كان نهج SAM [10]

ابدأ بـ ViT هرمي مُدرب مسبقاً بـ MAE.
تدريب SAM على مجموعات بيانات التجزئة المتاحة للجمهور.
استخدام SAM المُدرب جزئياً لتوليد أقنعة تجزئة على مجموعة فرعية من البيانات.
جعل البشر يحسنون تنبؤات التجزئة. ثم استخدام الأقنعة أيضاً لتدريب كاشف أشياء للعثور على المزيد من الأشياء وجعل البشر يصنفون ذلك.
تكرار الخطوات 3-4 مع زيادة حجم مجموعة البيانات تدريجياً
الانتهاء بالتشغيل على مليار صورة للحصول على SA-1B. استخدام فريق ضمان الجودة للإشارة إلى الأمث

The idea is the same for SAM 2 which is a video segmentation model, which generated SA-V dataset with 35.5M masks across 50.9K videos, 53x more masks than any video segmentation dataset [10].

Notice, the best segmentation model was trained with data directly relating to its task where the label feedback was all nicely coupled in a speedy, efficient data flywheel. Pretraining and then **training with a collection of open source segmentation datasets were only the first and second step.

Also notice that that human labeling eventually hit a ceiling; when the data flywheel started labeling 1B images Meta still needed to run a QA filter to flag bad examples. Based on the paper, annotating all 1.1B masks would’ve taken 51k days of annotation time! {5}

This is Meta we’re talking about but hiring that for most companies would be egregiously expensive & infeasible! {6} Labeling at this scale is just hard!

Reiterating the TL;DR, what SAM shows us is quality assurance and understanding what’s in our data is hard but an important problem to address. This is fundamentally the gap we see in industry today: more data used for pretraining or finetuning is not necessarily the answer. The right approach identifies where a model suffers, understands why it suffers there, and then highlights data (or data gaps) relevant to the problem, which is what we’re doing over at Interpret AI.

Annotation companies’ goals are not necessarily aligned with yours…

We have industry experience in MAANG and our team has experience working with annotation companies like Scale, SuperAnnotate, etc. For most labeling (annotation) companies, the business model is:

Let companies generate their own labeling (annotation) spec with perhaps some back & forth depending on the complexity of the labels.
Most annotation companies have different tiers of annotators, the largest pool being non-experts who label everything and the smallest being experts in the field (i.e. Doctors). An annotation company then marshals a pool of human labelers, typically starting with the cheapest ones to do a low quality first pass.
The annotators then label according to the company’s complex annotation spec as best they can, charging per annotation.
Provide feedback and updates to the annotations, possibly updating the annotation spec.

There are four main problems with this process:

annotations are not consistent and are usually not assigned to the right labelers,
the labeling is time-consuming & expensive,
the feedback loop for correcting annotations is erroneous, and
annotation specs change over time as model performance changes.

Addressing 1., labelers are not guaranteed to be suited to their assigned labeling task and often label differently than their peers. For instance, for a healthcare company if the task is “Pick the clinical response that bests diagnoses the patient” these labelers may not even be doctors suited to the task! Additionally, for an autonomous driving company if the task is to “Draw bounding boxes for stop signs” does this include the pole or not? What if it’s the back side of a stop sign? Different annotators will label differently without consulting each other.

Addressing 2., charging per annotation sounds great in theory as the conventional dogma is that more labels help but if and only if the company can afford the cost of a sufficient number of labels to boost model performance; a number that is typically unknown. These annotations will also typically have errors that require AI companies to build internal systems that review the annotations which takes both time (order of months) and more money.

Addressing 3., The feedback loop is not consistent either. Typically the responsibility of annotation verification is pushed to the AI company, which needs to set up their own internal monitoring system (already time-consuming and costly). When an AI company notices an annotation issue, corrections are not guaranteed to be from the same annotator who created the problematic label and sometimes annotation companies will relabel the entire problematic example instead of correcting it which costs more. For instance, for an autonomous driving company might want to label instance masks of traffic lights and people. In this dummy example, the first annotator makes a mistake and forgets to label traffic lights not facing the camera. The AI company flags it and sends it off to be re-reviewed but the way the annotation company fixes this is by sending the image to a new annotator who relabels everything from scratch! The second annotator fixes the original issue but doesn’t label policeman as “people” and now a new issue emerges! See Figure 3a and Figure 3b. This loop has an incredibly low probability of correctly annotating objects correctly ~61% for 50 labels {7}.

Figure 3a: First pass by the first annotator who missed the traffic lights that are not facing the camera. (Image from Waymo Open Dataset [17])

Figure 3b: Second pass from the second annotator who got all the traffic lights but didn’t realize that the “people” class included police officers! (Image from Waymo Open Dataset [17])

Essentially, with this feedback system the labels an annotation company creates are not guaranteed to converge to the right labels!

The incentives of AI companies are not well aligned with those of labeling companies. AI companies want to improve their AI model and their product while annotation companies want to label as much company data as possible so that they can charge for it. You want to make your model performant and so should annotation companies.

Addressing 4., In industry (and research), when trying to solve a problem, there are many possible solutions. Perhaps pretraining on the entire internet will improve your LLM, or perhaps grounding an LLM by training on labeled text-images pairs will help with LLM reasoning, or perhaps adding chain of thought will help. In other words, when designing AI systems we need to try a lot of different things in parallel since sometimes it’s unclear what the best approach will be. Labeling is one solution, which means that as we better understand our problem the label definition is subject to change.

For instance, take labeling stop signs in autonomous driving; suppose that we first label stop signs. We notice that performance improves when we know if a stop sign is partially obstructed, so we update the annotation spec to add a metadata tag called “obstructed” later on when the sign is partially or not visible. We then go back to an annotation company and ask them to relabel all our stop signs with this! This “annotation-platform in the loop” means that every model experiment that updates the labeled dataset is super expensive!

So, one may wonder, why are labeling providers used at all? For two reasons: First, high quality labels on data do help as discussed earlier. In fact, less data with higher quality labels can outperform some of these large pretrained models; SAM being an excellent example. Second, the alternative to not using an annotation company is to create an internal annotation platform which is even more expensive and time consuming, since producing the same volume of labels as the other players can take years!

Conclusion

The optimal data flywheel represents data in a form that’s inherently insightful and interactable: we should be able to detect anomalies and also chat with our data to garner interesting patterns and insights. This flywheel should enhance annotation platforms by focusing on what should be labeled instead of labeling everything {8}. And finally, this data flywheel should align with model performance, tying directly to whatever problem your AI company is solving.

The traditional dogma is that more data “just works” and sometimes deep learning feels like alchemy. Perhaps more data will work for you in the short run but when things “just don’t work” the proper way is to assess failure both in the data & the model and work from there.

Over at Interpret we hope to change the paradigm. If you are interested, reach out to us at ily@interpretai.tech

Footnotes

Back when AlexNet was still a thing circa 2015ish most models for computer vision were trained on a subset of very particular problem types: classification, segmentation, object detection (ie foundation problems) and others like image captioning, scene recognition, pose estimation (see appendix for more details)[1]. Note this was pre “Attention is all you need” when bigrams were a-la-mode. The focus then was model development while benchmarks remained fixed. These benchmarks were “largish” labeled datasets (order of 10k to 1M) that were used to evaluate model performance. Some of the popular CV benchmarks you’re probably familiar with are MNIST, ImageNet, MS COCO, KITTI, Caltech-101 [2]. If you look the largest labeled datasets around this time they were around 1M labels, and that was considered large at the time.
Modern pretraining entered the chat around 2017 and changed the game. Borrowing from representation learning, pretraining came as a fundamental paradigm shift from learning features for only a specific labeled dataset to learning general features on unlabeled data that correlated well with other problems like classification, segmentation, object detection. These datasets compared to their labeled brethern were massive [5]. At the same time, advancements in model training (CUDA optimization which is why NVIDIA hit a 4T market cap), deep learning libraries (tensorflow, pytroch), and new / improved model architectures like Transformers from “Attention Is All You Need” opened up a brand new world. Researchers also noticed that increasing the size of models typically correlated with improved performance on unseen data (from the same data distribution). All of this combined interfaced with modern pretraining algorithms like pretext tasks, contrastive learning, masked label modeling, masked autoencoding (MAE) multimodal modeling [4] unlocking the era of training big models on even massive unlabeled datasets. Ergo, models like CLIP [13], DALL-E [14], DINOv2 [15], BERT [16].
”Alignment” is an overused term I mean alignment in both the “we want our LLM to be helpful not harmful” sense and the “data distribution alignment” sense.
When training / fine-tuning a model, scaling model size correlates with improvement in performance roughly following a power law. In industry, we’re already hitting the peak for model size scaling laws and fine-tuning is giving less and less of an advantage. The next frontier is improving pretraining method to better utilize existing unlabeled datasets.
In the SAM paper, annotations could take 30 seconds (but suppose it took 4 seconds based on the improvements from SAM v2 [10]); reviewing 1.1B masks would’ve required 1,100,000,000 * 4 seconds = ~51,000 days of annotation time!
This is also assuming that the data distribution is stationary (unchanging). If we wanted to increase the labels to a different data distribution (say deep sea diving videos where the semantics & dynamics of objects is different) then finetuning SAM would still require the same data flywheel training process which is also more time and more money.
Suppose that each object has a probability of being mislabeled p=0.01 (ie an annotator labels incorrectly or misses a label once every 100 labels). Assuming 50 objects in a video the probability of succeeding assuming independence is (1 - p)^50 = 61% chance of success! And that’s conservative.
Fundamentally, when AI companies have better clarity on what to label their incentives align with annotation companies.
More and more it is clear very few samples (e.g. thousands) of very high quality data is way better than million of low quality data - this is particularly true in post-traning of LLMs in industry but it is starting to be the focus also of pre-training.
A data flywheel is the loop used to collect data, improve the model, which makes a better product, which then modifies what data to collect and the cycle repeats (for example this image from dataloop.ai https://dataloop.ai/book/the-data-flywheel-effect/). A data engine is the infra for collecting/labeling/evaluating data (for example Scale’s product https://scale.com/data-engine).

Special Thanks

Cameron Tukerman-Lee (also credit for the title)
Gabriele Sorrento
Francesco Pongetti
Lotfi Herzi

Appendix

[1] A more extensive list of popular 2015 foundational problems across different domains so sortof pre multi-modal.
- Computer vision
  - classification
  - segmentation
  - object detection
  - image captioning
  - scene recognition
  - pose estimation
  - Optical Flow Estimation
  - Depth Estimation
  - Face recognition
  - Pose estimation
  - Visual tracking
  - Style transfer
  - Image generation
- Natural Language Processing
  - Machine translation
  - Part of speech tagging
  - Question answering
- Speech Processing
  - Speech recognition
  - Speaker identification
  - Emotion classification
- Time series
- Reinforcement Learning
[2] Popular datasets separated by domain around 2015 Classification: Segmentation: Object Detection: Other Tasks: Depth Estimation: Optical Flow: Pose Estimation: Face Recognition: Video/Action Recognition: Attributes/Multi-label: Reinforcement Learning: Can think of dataset size as number of rollouts.
- ImageNet (ILSVRC 2017) - 1.2M training, 1000 classes - https://www.image-net.org/challenges/LSVRC/2017/index.php
- CIFAR-10/100 - 60K (32x32), 10/100 classes - https://www.cs.toronto.edu/~kriz/cifar.html
- MNIST - 70K handwritten digits - https://www.kaggle.com/datasets/hojjatk/mnist-dataset
- Fashion-MNIST - 70K fashion items - https://github.com/zalandoresearch/fashion-mnist
- SVHN - 600K real world house numbers 10 classes for each digit - http://ufldl.stanford.edu/housenumbers/
- Caltech-101/256 - 9K/30K images 101/256 categories - https://data.caltech.edu/records/mzrjq-6wc02, https://data.caltech.edu/records/nyy15-4j048
- Oxford Flowers 102 - 102 categories - https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
- Oxford-IIIT Pets - 7.4K images, 37 pet breeds - https://www.robots.ox.ac.uk/~vgg/data/pets/
- Stanford Cars - 16K images, 196 car models - https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset
- FGVC Aircraft - 10.2K images, 100 aircraft variants - https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/
- Food-101 - 101 food categories - https://www.kaggle.com/datasets/dansbecker/food-101
- CUB-200-2011 - 12K bird images, 200 species - https://www.vision.caltech.edu/datasets/cub_200_2011/
- Stanford Dogs - 20K images, 120 dog breeds - http://vision.stanford.edu/aditya86/ImageNetDogs/
- MIT Indoor Scenes - 15K images, 67 indoor categories - http://web.mit.edu/torralba/www/indoor.html
- PASCAL VOC 2012 - 11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
- MS COCO - 328K images, 80 object classes, 91 stuff categories, 5 captions per image, 250k people with keypoints https://cocodataset.org/
- Cityscapes - 5K fine/25K coarse annotations, 8 classes - https://www.cityscapes-dataset.com/, https://www.cityscapes-dataset.com/dataset-overview/#class-definitions
- ADE20K - 25K images, 150 classes - https://groups.csail.mit.edu/vision/datasets/ADE20K/
- PASCAL Context - 10K images, 459 classes - https://cs.stanford.edu/~roozbeh/pascal-context/
- SBD (Semantic Boundaries) - 11K images from PASCAL - https://paperswithcode.com/dataset/sbd
- NYUDv2 - 1.4K RGB-D images - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- SUN RGB-D - 10K RGB-D images - https://rgbd.cs.princeton.edu/
- KITTI Semantic - http://www.cvlibs.net/datasets/kitti/
- PASCAL VOC 2012 - 10K/11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/
- MS COCO - 328K images, 80 classes, 1.5M instances - https://cocodataset.org/
- KITTI Object - http://www.cvlibs.net/datasets/kitti/
- Open Images (v1 in 2016) - 15.8 images, 6000 classes - https://storage.googleapis.com/openimages/web/index.html
- WIDER Face - 32K images, 393K face annotations - http://shuoyang1213.me/WIDERFACE/
- NYUDv2 - 1.4K RGB-D scenes - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- KITTI Depth- http://www.cvlibs.net/datasets/kitti/
- Make3D - 534 images with depths - http://make3d.cs.cornell.edu/data.html
- Sintel - http://sintel.is.tue.mpg.de/
- KITTI Flow - http://www.cvlibs.net/datasets/kitti/
- Flying Chairs - 22K synthetic pairs - https://lmb.informatik.uni-freiburg.de/resources/datasets/FlyingChairs.en.html
- Middlebury - Small but precise benchmark - https://vision.middlebury.edu/flow/
- MPII Human Pose - 25K images, 40K people - http://human-pose.mpi-inf.mpg.de/
- FLIC - 5003 images from movies - https://bensapp.github.io/flic-dataset.html
- Leeds Sports Pose - https://www.kaggle.com/datasets/dkrivosic/leeds-sports-pose-lsp
- LFW (Labeled Faces in the Wild) - 13K images, 5.7K people -https://www.kaggle.com/datasets/jessicali9530/lfw-dataset
- CelebA - 200K images, 10K identities - http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
- MegaFace - 1M images, 690K identities - http://megaface.cs.washington.edu/
- VGGFace - 2.6K people - https://www.robots.ox.ac.uk/~vgg/data/vgg_face/
- UCF-101 - 13,320 videos, 101 actions - https://www.crcv.ucf.edu/data/UCF101.php
- HMDB-51 - 6800 videos, 51 actions - https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
- Sports-1M - 1M YouTube videos, 487 sports - https://cs.stanford.edu/people/karpathy/deepvideo/
- ActivityNet - 20K videos, 200 classes - http://activity-net.org/
- WIDER Attribute - http://mmlab.ie.cuhk.edu.hk/projects/WIDERAttribute.html
- Berkeley Attributes - https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/poselets/
- Classic control tasks
  - OpenAI Gym (cartpole, mountaincar, acrobat, etc). I remember this before chatgpt lol maybe I’m old
  - MuJoCo (Multi-joint dynamics with contact) like the halfcheetah, hopper, humanoid, etc. This was typically done in a physics simulation and was popular for PPO.
- Board games
  - Go
  - Chess
  - PyGame
- TORCS
- Minecraft
- ViZDoom
- Atari 2600 from DeepMind
[3] Scaling Laws Paper, Larger pretrained models paper
- "Scaling Laws for Neural Language Models" by Jared Kaplan et al. (2020): https://arxiv.org/abs/2001.08361
- "Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level”: https://arxiv.org/abs/2105.06020
[4] Modern pretraining algorithms Pretext Tasks: Contrastive Learning Methods: Masked Modeling: Multimodal Learning:
- Rotation prediction
- Jigsaw puzzles
- Colorization
- Inpainting/Masked patches
- SimCLR (Chen et al., 2020): "A Simple Framework for Contrastive Learning of Visual Representations" [2002.05709] A Simple Framework for Contrastive Learning of Visual Representations
- MoCo v1 & v2 (He et al., 2019/2020): "Momentum Contrast for Unsupervised Visual Representation Learning" [2003.04297] Improved Baselines with Momentum Contrastive Learning
- BYOL (Grill et al., 2020): "Bootstrap Your Own Latent"
- PIRL (Misra & van der Maaten, 2020): "Self-Supervised Learning of Pretext-Invariant Representations" Self-Supervised Learning of Pretext-Invariant Representations
- Masked Language Modeling (MLM): BERT (Devlin et al., 2018)
- Masked Autoencoder (MAE)
- CLIP (Radford et al., 2021): "Learning Transferable Visual Models From Natural Language Supervision" [2103.00020] Learning Transferable Visual Models From Natural Language Supervision
- ALIGN (Jia et al., 2021)
- DALL-E (Ramesh et al., 2021): "Zero-Shot Text-to-Image Generation"
[5] Pretraining datasets
- JFT-300M: google’s internal 300M images psudeo labeled: https://ar5iv.labs.arxiv.org/html/1707.02968 (TO VERIFY)
- LAION-5B: 5.85 billion (image, text) pairs scraped from Common Crawl
- CLIP Training Data: 400M (image, text) pairs https://arxiv.org/abs/2103.00020 (not released)
- Wikipedia: English 20GB
- Kinetics-700: 650k videos (technically has action classes but still used)
[6] Improving Language Understanding by Generative Pre-Training https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[7] Video generation models as world simulators: https://openai.com/index/video-generation-models-as-world-simulators/
[8] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism https://arxiv.org/abs/2401.02954
[9] Constitutional AI: Harmlessness from AI Feedback https://arxiv.org/abs/2212.08073
[10] Segment anything: https://arxiv.org/abs/2304.02643, SAM 2: Segment Anything In Images & Videos https://arxiv.org/pdf/2408.00714. More details below.
[11] https://techcrunch.com/2025/06/13/new-details-emerge-on-metas-14-3b-deal-for-scale/
[12] https://www.nature.com/articles/s41586-025-09227-0
[13] "Learning Transferable Visual Models From Natural Language Supervision” https://arxiv.org/abs/2103.00020
[14] "Zero-Shot Text-to-Image Generation” https://arxiv.org/abs/2102.12092
[15] "Emerging Properties in Self-Supervised Vision Transformers” https://arxiv.org/abs/2104.14294, "DINOv2: Learning Robust Visual Features without Supervision” https://arxiv.org/abs/2304.07193
[16] “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” https://arxiv.org/abs/1810.04805
[17] Waymo E2E Open dataset https://waymo.com/open/data/e2e#camera-data