TL;DR

הדוגמה עבור חברות AI היא שיותר נתונים מובילים לביצועים טובים יותר, אבל למעשה קנה מידה של נתונים אינו כל מה שצריך.

נתונים באיכות גבוהה מניבים ביצועים טובים יותר בהשוואה למערך נתונים גדול יותר באיכות נמוכה.
ייצור נתונים באיכות גבוהה דורש סינון דרך רעש, הבנת נתונים לא מתויגים, והבנה מה לתייג.
תיוג נתונים מסיבי על ידי פלטפורמות אנוטציה הוא גם בעייתי מכיוון שהתמריצים שלהם לרוב אינם מיושרים והפלטפורמה שלהם היא צוואר בקבוק שגוזל זמן, נוטה לשגיאות ויקר.
הדרך הטובה ביותר לשפר מערכות AI היא להבין את הנתונים המזינים את המודלים על ידי ייצוג אינטליגנטי של מערכי נתונים בצורה אינטראקטיבית באמצעות למידת ייצוג עצמי-מפוקחת, מודלים בסיסיים וסינון.
שיטות אלה מונעות את הסיכון של ביצועים גרועים במערכות AI ואת הסיכון של יצירת פלטים מזיקים.

פחות זה יותר

גודל הנתונים אינו כל מה שאתם צריכים. הגדלה עיוורת של גודל מערך הנתונים בזמן אימון מקדים של מודל מעמידה חברות המתמקדות ב-AI בסיכון לביצוע טעויות חמורות. אימון מודלים על מערכי נתונים גדולים עם התפלגות לא ידועה מוביל להתנהגויות בלתי צפויות: ברובוטיקה זה עלול להוביל למסלולים שגויים ומסוכנים, עבור חברת בריאות להערכות סיכון לא מדויקות, ועבור LLMs לייצור דיבור מזיק ⁹. ב-X, Grok עשה את הטעות הזו, ייצר דיבור מזיק בפוסט שנמחק המוצג באיור 0a. אפילו מנכ"ל xAI הודה שהם צריכים להיות יותר "סלקטיביים לגבי נתוני האימון, במקום פשוט להתאמן על כל האינטרנט". אבל איך בוחרים נתונים כראוי כדי לאמן ולהעריך את המודלים האלה כראוי? אילו כלים קיימים?

הפתרון הוא לייצג נתונים בצורה חכמה בצורה שניתן לקיים איתה אינטראקציה ושמגוונת מספיק מבחינה סמנטית. גישה זו עוזרת: 1. ליצור מערכי נתונים לאימון והערכה הן לאימון מקדים והן לאימון לאחר מכן, 2. לזהות חורים בנתונים ו-3. להמליץ כיצד למלא את הפערים האלה (על ידי קנייה או איסוף).

Figure 0a: Examples of an LLM generating harmful speech likely due to existence of similar text in the training data the xAI team used to train Grok.

Figure 0b: Reaction from the xAI CEO after Grok generated harmful speech. The interesting piece is the teams focus on being selective of the training data. Original post from the Grok CEO https://x.com/elonmusk/status/1944132781745090819

גלגלי נתונים וחברות הערות

בתעשייה, רוב מנכ"לי חברות ה-AI, חוקרי ה-AI והמהנדסים אינם מרוצים מחברות ההערות המודרניות המשתלבות בגלגלי הנתונים שלהם.

הפתרון הנפוץ כיום עבור חברות AI הוא לצבור מערך נתונים גדול ללא תיוג לאימון מקדים (או להשתמש במודל מאומן מראש בקוד פתוח), לאחר מכן לתייג מערך נתונים גדול נוסף ספציפי למשימה המיועדת, ולבסוף לאצור ידנית סט אימון וסט הערכה. התיוג בדרך כלל מועבר במיקור חוץ לחברות הערות (ScaleAI, SuperAnnotate, Labelbox וכו') המשתלבות במנוע הנתונים. אבל תיוג של הכל במערך נתונים גדול לא עובד טוב כי הרחבת תיוג נתונים למיליוני או מיליארדי דוגמאות נוטה לטעויות, יקרה באופן בלתי בר-קיימא וגוזלת זמן, מה שמותיר את חברות ה-AI לא מרוצות. אבל חשוב יותר, לולאת התיוג היא תהליך שאינו נגמר מכיוון שגלגלי נתונים מסתגלים ברציפות למודלים מתפתחים ולנתונים שנאספים יותר, מה שהופך את דרישות התיוג לנזילות ומשתנות לאורך זמן; חברות ההערות לא יכולות לעמוד בקצב השינויים מכיוון שעדכוני מודל יכולים לקרות תוך שבועות בעוד שתיוג יכול לקחת חודשים.

לולאת התיוג המודרנית במנוע נתונים היא:

איסוף נתונים.
עיצוב או עדכון של מפרט תיוג כלשהו.
שליחת הנתונים והמפרט לחברת תיוג כלשהי (Scale, SuperAnnotate וכו'). תשלום עבור התיוג.
איטרציה עם חברת התיוג ואימון המודל.
התבוננות בתוצאות ולאחר מכן חזרה על שלבים 2-5 ללא הגבלה.

לדוגמה, חברת נהיגה אוטונומית עשויה לרצות לתייג תמרורי עצור אבל לאחר תיוג של מיליון תמרורי עצור וראיית התוצאות הם מבינים שהם רוצים לתייג את "הנראות" של תמרור העצור, ואז הם מבינים שהם גם רוצים לתייג עצים שעשויים להקיף תמרורי עצור ולהוסיף תווית "מוסתר". עכשיו כל הנתונים (שגם גדלו בינתיים מכיוון שאיסוף נתונים הוא רציף) צריכים להיות מתויגים מחדש! המחזור לעולם לא ייגמר כל עוד חברה משפרת את המודל שלה!

ההשקעה של Meta של 14.3 מיליארד דולר עבור 49% מהמניות כדי לשכור את מנכ"ל Scale.AI ¹¹ עשויה להיות אחד המהלכים המסוכנים ביותר שהחברה אי פעם עשתה בגלל הקשיים האלה עם חברות תיוג.

אז, אם אימון עיוור על מערכי נתונים עצומים הוא בעייתי, ותיוג של הכל הוא קשה, מה עוד עלינו לעשות? לאחר עבודה על הנושא הזה במשך ארבע השנים האחרונות, מצאנו שהפתרון הטוב ביותר הוא לייצג נתונים מספיק טוב כך שיהיה קל יותר לבחור ולהבין מה יש בנתונים שלנו וכיצד הנתונים האלה משפיעים על המודלים שלנו. אנחנו צריכים להיות מסוגלים לשוחח עם הנתונים שלנו בצורה שמאפשרת לנו לחפש במהירות דוגמאות ולבנות במהירות סטים להערכה כדי לבדוק מודלים.

זה מה שאנחנו בונים ב-Interpret AI. אנחנו בונים פלטפורמת התבוננות פנימית בנתונים, פלטפורמת אוצרות נתונים ושוק נתונים חכם המאפשר לחברות הבונות מערכות AI לקיים אינטראקציה ולהבין את מערכי הנתונים שלהן. אנחנו רואים עולם שבו אתם יכולים לשוחח עם הנתונים שלכם באמצעות שפה טבעית, אודיו, תמונה ווידאו כדי לחפש מופעים דומים כך שחברות יכולות לסמוך ולדעת את הנתונים שלהן (או את הפערים בנתונים שלהן) שמניעים את המודלים שלהן.

הרחיבו תחילה את מה שכנראה מועיל

גלגלי נתונים מסורתיים

Figure 1a: The traditional data engine powering AI solutions in companies.

לחברה יש תשתית כלשהי שאוספת נתונים באופן מתמיד למערך נתונים (1b). צוות אז יוצר תת-קבוצות נתונים היוריסטיות שבתקווה ברגע שיתויגו ישפרו את המודל שלהם (1a).
הנתונים נשלחים לחברת התיוג (הערות). חברת התיוג מייצרת תוויות (הערות) שאז נבדקות על ידי הצוות, מה שיכול לקחת חודשים של הלוך ושוב כדי להתכנס.
מודל ה-AI המאומן מראש אז מאומן מראש.
המודל המאומן מראש אז מכוונן באמצעות התוויות מחברת התיוג
המודל הסופי מוערך באמצעות מערכת ההערכה של החברה, מייצר מדדים.
החברה אז משתמשת במשוב הזה כדי אולי לבחור תת-קבוצות נתונים אחרות, לעדכן את דרישות התיוג, ו/או לבצע שינויי מודל. שימו לב שבשלב זה תת-קבוצת הנתונים כבר הולכת ומתיישנת.

הערה: מדדים עשויים להיות מוטים על ידי הערות גרועות הדורשות איטרציה מתמדת מהצוות שהיא גם יקרה וגם לא יעילה בזמן (6).

Figure 1b: A breakdown of the time requirements for different processes in a traditional company’s approach to solutions. Notice that the major bottleneck is getting labels from a labeling company.

איור 1b: מגבלות זמן והגדרה של מערכת AI של חברה מסורתית עם לוחות זמנים משוערים לאיטרציה של כל אחד מהחלקים האלה באופן עצמאי. שימו לב שעם חברת תיוג בלולאה, ייקח חודשים של איטרציה כדי לייצר תוויות שמשפרות כראוי מודל AI. ראו איור 1a לאופן שבו כל אחד מהחלקים האלה מקיימים אינטראקציה עם חברה מסורתית.

גלגל הנתונים של Interpret AI: התחילו לדעת עם תובנות נתונים עמוקות

Figure 2a: Interpret’s AI data flywheel & how we provide immediate data insights.

איור 2a: גלגל הנתונים של Interpret AI.

המלצות מיידיות על תת-קבוצות נתונים והצעות נתונים משופרות לאימון מקדים ואימון (1a ו-1b בהתאמה).
הצוות עכשיו בודק תת-קבוצות נתונים קטנות משמעותית שהוצעו על ידי Interpret לפני שליחה לחברת תיוג. תת-קבוצות הנתונים האלה הן נזילות ומתעדכנות ברציפות כשהנתונים משתנים (אופציונלי, אם חברה משלבת את מודל הבסיס שלה, Interpret AI יכולה לספק יותר תובנות על האופן שבו הנתונים משפיעים על ביצועי המודל).
ההלוך ושוב בין חברת תיוג מואץ מחודשים לשבועות והוא זול משמעותית מכיוון שמפרטי ההערות ובחירת מערך הנתונים ברורים.

המשוב ממוקד במודל (6).
לבסוף, Interpret AI מנתחת את מרחב הנתונים שלכם כדי לספק תובנות על אילו נתונים לאסוף או לקנות כדי להאיץ את שיפור המודל.

Figure 2b: A breakdown of the time requirements for different processes in using Interpret’s platform. On the left hand side feedback iteration speed in green is accelerated. Notice there is no more bottleneck.

איור 2b: האיור מדגים כיצד Interpret AI משתלבת ישירות עם הלקוחות שלנו כדי להאיץ אימון מודל, מיון והבנת נתונים והערכה. Interpret AI מספקת פתרונות להבנת התפלגות הנתונים הקיימת.

זיהוי פערי מודל המתואמים עם פערי נתונים.
קנייה ואוצרות נתונים למילוי פערי נתונים.
מקרי שימוש

אנחנו משתפים פעולה עם מספר עסקים בתעשיות הרובוטיקה, הבריאות וה-LLM הסוכני:

בריאות

HealthCo מנסה לחזות את הסיכון למחלות לב וכלי דם עבור המטופלים שלהם.

לאימון

Interpret AI מנתחת נתוני לב וכלי דם באמצעות מודלי היסוד של interpret שלנו, מעבדת EHRs, תמונות, אולי נתוני ECG ¹² אם זמינים.

Interpret AI מבחינה באנומליות או "חורים" ב-HealthCo ומתארת את הדמוגרפיה של האנשים האלה (כלומר נשים, בגיל העמידה, ללא ילדים, נרשם להם היסטורית trimetazidine).
הרשומות שזוהו מנותחות עוד יותר על ידי מומחים. הנתונים שנבחרו יכולים אז להתעדכן, להתעלם מהם, לשמש לעזרה ברכישת יותר נתונים של אנשים שנרשם להם היסטורית trimetazidine, או להישלח לחברת תיוג כדי להעיר על הקבוצה הספציפית הזו.
הנתונים שנבחרו משמשים אז לאימון מודל AI למחלות לב וכלי דם. אם [LINK1]HealthCo[/LINK1] משלבת את מודל הלב וכלי הדם שלה בפלטפורמת Interpret אז אנחנו מנתחים עוד יותר היכן המודל מתפקד בצורה גרועה בזמן אמת, מה שמאפשר התבוננות פנימית מיידית.
התהליך הזה מקטין את ציר הזמן של אימון המודל מסדר גודל של חודשים לשבועות, משפר במהירות מערכות AI וחוסך עלויות!
לבטיחות

נניח ש-HealthCo יש דוגמאות של אנשים שסבלו מהתקפי לב והם רוצים לנתח EHRs אחרים של אנשים שדומים לאדם הזה שגם עשויים להיות בסיכון

באמצעות Interpret AI, HealthCo יכולה לבחור דוגמאות של האדם הזה ולחפש מאגר קשור של אנשים, למיין לפי רמת ביטחון.

האנשים האלה יכולים להיות מסומנים כבסיכון, מזהים במהירות כמה מאות אנשים בסיכון ממיליוני רשומות!
רובוטיקה

DriveCo בונה מכוניות מרוץ אוטונומיות כצעצוע לילדים לשחק איתו בחוץ.

לאימון

Interpret AI מנתחת את הריצות שנאספו של נתוני וידאו של מכוניות מרוץ. Interpret AI נותנת דוח נתונים.

Interpret AI מבחינה שרוב השידורים החוזרים מהסרטונים אינם מגוונים גיאוגרפית ושיש מעט דוגמאות של מכוניות מרוץ נוסעות בחוץ בחצרות אחוריות.
Interpret AI ממליצה לצוות DriveCo לאסוף יותר דוגמאות של סרטוני חוץ. אנחנו גם מנסים לאזן את מערך הנתונים בצורה נלמדת באמצעות מודל היסוד של Interpret AI כדי להקל על חוסר האיזון הזה.
בלי Interpret AI, DriveCo עשויה הייתה לשלוח למעלה מ-1000 שעות של נתוני מכוניות מרוץ לתיוג אובייקטים שלא היה צורך בהם! עכשיו הם צריכים לתייג רק 10 שעות!
- לבטיחות

נניח שמכוניות המרוץ האוטונומיות האלה מתמודדות עם ביקורת על בטיחות תינוקות.

DriveCo יכולה לחפש במסד הנתונים שלה סרטונים המכילים "תינוק" כדי לראות אם יש להם את הנתונים האלה.

אם ל-DriveCo אין את הנתונים זה מודיע לצוות לאסוף אותם (באמצעות אולי תינוקות מזויפים אני מקווה) או זה מאפשר ל-DriveCo להראות לצרכנים ולמשקיעים שהמוצר הוא למעשה בטוח סביב תינוקות!
איך הגענו לכאן

היסטוריה קצרה על תוויות ואימון מקדים

ב-20¹5, לפני Transformers, רוב המודלים אומנו לפתור תת-קבוצה מאוד מסוימת של בעיות: סיווג, פילוח, זיהוי אובייקטים (כלומר בעיות יסוד) ואחרות [1]. מדדים היו מערכי נתונים מתויגים "גדולים למדי" בסדר גודל של 10k עד 1M. 1

אימון מקדים מודרני נכנס לשיחה בסביבות ²017 ושינה את המשחק. בהשאלה מלמידת ייצוג, אימון מקדים הגיע כשינוי פרדיגמה יסודי שבו פתאום מערכי נתונים לא מתויגים פתחו רווחים עצומים בביצועי המודל. מערכי הנתונים הלא מתויגים ששימשו לאימון מקדים בהשוואה לאחיהם המתויגים היו מסיביים [5]. זה בשילוב עם טכניקות והתקדמויות אחרות 2 הוביל למודלי יסוד מודרניים כמו CLIP [13], DALL-E [14], DINOv2 [15], ו-BERT [16] אם נזכיר רק כמה.

אז OpenAI, שנבנתה על בסיס של transformers, אימון מקדים והתקדמות בלמידת חיזוק, שינתה את המשחק כששחררה את GPT (generative pre-trained transformer) [6]. Sora [7], DeepSeek [8], Anthropic [9] כולם משתמשים באימון מקדים על מערכי נתונים גדולים כעמוד השדרה למודלים הביצועיים שלהם. אבל מוסתר שם יש תצפית חדה שרוב האנשים לא מדברים עליה.

בעוד שאימון מקדים הוא שלב ראשון טוב, רוב המודלים האלה צריכים אימון נוסף על גבי בסיס מאומן מראש. בין אם זה RL או כיוונון מפוקח, המודלים הביצועיים ביותר מיושרים ³ איכשהו לבעיה המקורית. אבל אפילו כיוונון מתרחב עד נקודה מסוימת, כלומר שיפור אימון מקדים הוא חיוני לביצועי מודל עתידיים ⁴.

אחת הדוגמאות המשכנעות ביותר לאופן שבו לשלב כראוי אימון מקדים ולבנות גלגל נתונים בספרות היא גלגל הנתונים המתויג שנבנה על ידי Meta ב-Segment Anything Model (SAM) ו-SAM v2 [10]. אבל אפילו בדוגמה הזו, תיוג נתונים הוא קשה להפליא להרחבה.

Segment Anything: החידושים והמסר

TL;DR: מה ש-SAM מראה לנו הוא שהבטחת איכות והבנה של מה יש בנתונים שלנו היא קשה אבל בעיה חשובה לטיפול. הוספת יותר נתונים אינה בהכרח התשובה.

SAM בנתה גלגל נתונים שאצר מערך נתונים מתויג גדול באמצעות SAM מאומנת חלקית בשלבים שונים של אימון עם משוב תיוג אנושי. הגישה שלהם ממחישה את הדרך הנכונה לשלב תיוג בצינור אבל גם מדגישה שאפילו גלגל נתונים נכון לתיוג הוא יקר וקשה להרחבה. בשלב מסוים, מערך הנתונים גדל מספיק כך שבני אדם לא יכולים להעיר על הכל ולכן דורש שיטה אחרת של התבוננות פנימית (כלומר מה ש-Interpret בונה).

בערך, הגישה של SAM הייתה [10]

התחילו עם ViT היררכי מאומן מראש MAE.

אמנו את SAM על מערכי נתונים לפילוח זמינים לציבור.
השתמשו ב-SAM המאומנת חלקית כדי לייצר מסכות פילוח על תת-קבוצת נתונים.
בקשו מבני אדם לשכלל את תחזיות הפילוח. אז גם השתמשו במסכות כדי לאמן גלאי אובייקטים כדי למצוא יותר אובייקטים ובקשו מבני אדם לתייג את זה.
חזרו על שלבים 3-4 תוך הגדלה הדרגתית של גודל מערך הנתונים
סיימו על ידי הרצה על מיליארד תמונות כדי לקבל SA-1B. השתמשו בצוות QA כדי לסמן דוגמאות שעלולות להיות גרועות. שימו לב שמתן תוויות אנושיות לכל מיליארד התמונות הוא קשה להפליא.
הרעיון זהה עבור SAM 2 שהוא מודל פילוח וידאו, שייצר מערך נתונים SA-V עם 35.5M מסכות על פני 50.9K סרטונים, פי 53 יותר מסכות מכל מערך נתונים לפילוח וידאו [10].

שימו לב, מודל הפילוח הטוב ביותר אומן עם נתונים הקשורים ישירות למשימה שלו שבה משוב התווית היה כולו מחובר יפה בגלגל נתונים מהיר ויעיל. אימון מקדים ואז אימון עם אוסף של מערכי נתונים לפילוח בקוד פתוח היו רק השלב הראשון והשני.

שימו לב גם שתיוג אנושי בסופו של דבר הגיע לתקרה; כאשר גלגל הנתונים התחיל לתייג מיליארד תמונות, Meta עדיין הייתה צריכה להריץ מסנן QA כדי לסמן דוגמאות גרועות. בהתבסס על המאמר, הערה על כל 1.1B המסכות הייתה לוקחת ⁵1k ימי הערה! 5

זו Meta שאנחנו מדברים עליה אבל שכירת זה עבור רוב החברות הייתה יקרה להפליא ובלתי אפשרית! ⁶ תיוג בקנה מידה הזה הוא פשוט קשה!

חזרה על ה-TL;DR, מה ש-SAM מראה לנו הוא שהבטחת איכות והבנה של מה יש בנתונים שלנו היא קשה אבל בעיה חשובה לטיפול. זה בעצם הפער שאנחנו רואים בתעשייה היום: יותר נתונים המשמשים לאימון מקדים או כיוונון אינם בהכרח התשובה. הגישה הנכונה מזהה היכן מודל סובל, מבינה למה הוא סובל שם, ואז מדגישה נתונים (או פערי נתונים) רלוונטיים לבעיה, וזה מה שאנחנו עושים ב-Interpret AI.

המטרות של חברות הערות אינן בהכרח מיושרות עם שלכם...

יש לנו ניסיון בתעשייה ב-MAANG ולצוות שלנו יש ניסיון בעבודה עם חברות הערות כמו Scale, SuperAnnotate וכו'. עבור רוב חברות התיוג (הערות), מודל העסקי הוא:

לתת לחברות לייצר את מפרט התיוג (הערות) שלהן עם אולי קצת הלוך ושוב בהתאם למורכבות התוויות.

לרוב חברות ההערות יש רמות שונות של מעירים, המאגר הגדול ביותר הוא לא-מומחים שמתייגים הכל והקטן ביותר הוא מומחים בתחום (כלומר רופאים). חברת הערות אז מגייסת מאגר של מתייגים אנושיים, בדרך כלל מתחילה עם הזולים ביותר כדי לעשות מעבר ראשון באיכות נמוכה.
המעירים אז מתייגים לפי מפרט ההערות המורכב של החברה כמיטב יכולתם, גובים תשלום לכל הערה.
מספקים משוב ועדכונים להערות, אולי מעדכנים את מפרט ההערות.
יש ארבע בעיות עיקריות עם התהליך הזה:

הערות אינן עקביות ובדרך כלל לא מוקצות למתייגים הנכונים,

התיוג גוזל זמן ויקר
the labeling is time-consuming & expensive,
the feedback loop for correcting annotations is erroneous, and
annotation specs change over time as model performance changes.

Addressing 1., labelers are not guaranteed to be suited to their assigned labeling task and often label differently than their peers. For instance, for a healthcare company if the task is “Pick the clinical response that bests diagnoses the patient” these labelers may not even be doctors suited to the task! Additionally, for an autonomous driving company if the task is to “Draw bounding boxes for stop signs” does this include the pole or not? What if it’s the back side of a stop sign? Different annotators will label differently without consulting each other.

Addressing 2., charging per annotation sounds great in theory as the conventional dogma is that more labels help but if and only if the company can afford the cost of a sufficient number of labels to boost model performance; a number that is typically unknown. These annotations will also typically have errors that require AI companies to build internal systems that review the annotations which takes both time (order of months) and more money.

Addressing 3., The feedback loop is not consistent either. Typically the responsibility of annotation verification is pushed to the AI company, which needs to set up their own internal monitoring system (already time-consuming and costly). When an AI company notices an annotation issue, corrections are not guaranteed to be from the same annotator who created the problematic label and sometimes annotation companies will relabel the entire problematic example instead of correcting it which costs more. For instance, for an autonomous driving company might want to label instance masks of traffic lights and people. In this dummy example, the first annotator makes a mistake and forgets to label traffic lights not facing the camera. The AI company flags it and sends it off to be re-reviewed but the way the annotation company fixes this is by sending the image to a new annotator who relabels everything from scratch! The second annotator fixes the original issue but doesn’t label policeman as “people” and now a new issue emerges! See Figure 3a and Figure 3b. This loop has an incredibly low probability of correctly annotating objects correctly ~61% for 50 labels ⁷.

Figure 3a: First pass by the first annotator who missed the traffic lights that are not facing the camera. (Image from Waymo Open Dataset [17])

Figure 3b: Second pass from the second annotator who got all the traffic lights but didn’t realize that the “people” class included police officers! (Image from Waymo Open Dataset [17])

Essentially, with this feedback system the labels an annotation company creates are not guaranteed to converge to the right labels!

The incentives of AI companies are not well aligned with those of labeling companies. AI companies want to improve their AI model and their product while annotation companies want to label as much company data as possible so that they can charge for it. You want to make your model performant and so should annotation companies.

Addressing 4., In industry (and research), when trying to solve a problem, there are many possible solutions. Perhaps pretraining on the entire internet will improve your LLM, or perhaps grounding an LLM by training on labeled text-images pairs will help with LLM reasoning, or perhaps adding chain of thought will help. In other words, when designing AI systems we need to try a lot of different things in parallel since sometimes it’s unclear what the best approach will be. Labeling is one solution, which means that as we better understand our problem the label definition is subject to change.

For instance, take labeling stop signs in autonomous driving; suppose that we first label stop signs. We notice that performance improves when we know if a stop sign is partially obstructed, so we update the annotation spec to add a metadata tag called “obstructed” later on when the sign is partially or not visible. We then go back to an annotation company and ask them to relabel all our stop signs with this! This “annotation-platform in the loop” means that every model experiment that updates the labeled dataset is super expensive!

So, one may wonder, why are labeling providers used at all? For two reasons: First, high quality labels on data do help as discussed earlier. In fact, less data with higher quality labels can outperform some of these large pretrained models; SAM being an excellent example. Second, the alternative to not using an annotation company is to create an internal annotation platform which is even more expensive and time consuming, since producing the same volume of labels as the other players can take years!

Conclusion

The optimal data flywheel represents data in a form that’s inherently insightful and interactable: we should be able to detect anomalies and also chat with our data to garner interesting patterns and insights. This flywheel should enhance annotation platforms by focusing on what should be labeled instead of labeling everything ⁸. And finally, this data flywheel should align with model performance, tying directly to whatever problem your AI company is solving.

The traditional dogma is that more data “just works” and sometimes deep learning feels like alchemy. Perhaps more data will work for you in the short run but when things “just don’t work” the proper way is to assess failure both in the data & the model and work from there.

Footnotes

Back when AlexNet was still a thing circa 2015ish most models for computer vision were trained on a subset of very particular problem types: classification, segmentation, object detection (ie foundation problems) and others like image captioning, scene recognition, pose estimation (see appendix for more details)[1]. Note this was pre “Attention is all you need” when bigrams were a-la-mode. The focus then was model development while benchmarks remained fixed. These benchmarks were “largish” labeled datasets (order of 10k to 1M) that were used to evaluate model performance. Some of the popular CV benchmarks you’re probably familiar with are MNIST, ImageNet, MS COCO, KITTI, Caltech-101 [2]. If you look the largest labeled datasets around this time they were around 1M labels, and that was considered large at the time.
Modern pretraining entered the chat around 2017 and changed the game. Borrowing from representation learning, pretraining came as a fundamental paradigm shift from learning features for only a specific labeled dataset to learning general features on unlabeled data that correlated well with other problems like classification, segmentation, object detection. These datasets compared to their labeled brethern were massive [5]. At the same time, advancements in model training (CUDA optimization which is why NVIDIA hit a 4T market cap), deep learning libraries (tensorflow, pytroch), and new / improved model architectures like Transformers from “Attention Is All You Need” opened up a brand new world. Researchers also noticed that increasing the size of models typically correlated with improved performance on unseen data (from the same data distribution). All of this combined interfaced with modern pretraining algorithms like pretext tasks, contrastive learning, masked label modeling, masked autoencoding (MAE) multimodal modeling [4] unlocking the era of training big models on even massive unlabeled datasets. Ergo, models like CLIP [13], DALL-E [14], DINOv2 [15], BERT [16].
”Alignment” is an overused term I mean alignment in both the “we want our LLM to be helpful not harmful” sense and the “data distribution alignment” sense.
When training / fine-tuning a model, scaling model size correlates with improvement in performance roughly following a power law. In industry, we’re already hitting the peak for model size scaling laws and fine-tuning is giving less and less of an advantage. The next frontier is improving pretraining method to better utilize existing unlabeled datasets.
In the SAM paper, annotations could take 30 seconds (but suppose it took 4 seconds based on the improvements from SAM v2 [10]); reviewing 1.1B masks would’ve required 1,100,000,000 * 4 seconds = ~51,000 days of annotation time!
This is also assuming that the data distribution is stationary (unchanging). If we wanted to increase the labels to a different data distribution (say deep sea diving videos where the semantics & dynamics of objects is different) then finetuning SAM would still require the same data flywheel training process which is also more time and more money.
Suppose that each object has a probability of being mislabeled p=0.01 (ie an annotator labels incorrectly or misses a label once every 100 labels). Assuming 50 objects in a video the probability of succeeding assuming independence is (1 - p)^50 = 61% chance of success! And that’s conservative.
Fundamentally, when AI companies have better clarity on what to label their incentives align with annotation companies.
More and more it is clear very few samples (e.g. thousands) of very high quality data is way better than million of low quality data - this is particularly true in post-traning of LLMs in industry but it is starting to be the focus also of pre-training.
A data flywheel is the loop used to collect data, improve the model, which makes a better product, which then modifies what data to collect and the cycle repeats (for example this image from dataloop.ai https://dataloop.ai/book/the-data-flywheel-effect/). A data engine is the infra for collecting/labeling/evaluating data (for example Scale’s product https://scale.com/data-engine).

Special Thanks

Cameron Tukerman-Lee (also credit for the title)
Gabriele Sorrento
Francesco Pongetti
Lotfi Herzi

Appendix

[1] A more extensive list of popular 2015 foundational problems across different domains so sortof pre multi-modal.
- Computer vision
  - classification
  - segmentation
  - object detection
  - image captioning
  - scene recognition
  - pose estimation
  - Optical Flow Estimation
  - Depth Estimation
  - Face recognition
  - Pose estimation
  - Visual tracking
  - Style transfer
  - Image generation
- Natural Language Processing
  - Machine translation
  - Part of speech tagging
  - Question answering
- Speech Processing
  - Speech recognition
  - Speaker identification
  - Emotion classification
- Time series
- Reinforcement Learning
[2] Popular datasets separated by domain around 2015 Classification: Segmentation: Object Detection: Other Tasks: Depth Estimation: Optical Flow: Pose Estimation: Face Recognition: Video/Action Recognition: Attributes/Multi-label: Reinforcement Learning: Can think of dataset size as number of rollouts.
- ImageNet (ILSVRC 2017) - 1.2M training, 1000 classes - https://www.image-net.org/challenges/LSVRC/2017/index.php
- CIFAR-10/100 - 60K (32x32), 10/100 classes - https://www.cs.toronto.edu/~kriz/cifar.html
- MNIST - 70K handwritten digits - https://www.kaggle.com/datasets/hojjatk/mnist-dataset
- Fashion-MNIST - 70K fashion items - https://github.com/zalandoresearch/fashion-mnist
- SVHN - 600K real world house numbers 10 classes for each digit - http://ufldl.stanford.edu/housenumbers/
- Caltech-101/256 - 9K/30K images 101/256 categories - https://data.caltech.edu/records/mzrjq-6wc02, https://data.caltech.edu/records/nyy15-4j048
- Oxford Flowers 102 - 102 categories - https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
- Oxford-IIIT Pets - 7.4K images, 37 pet breeds - https://www.robots.ox.ac.uk/~vgg/data/pets/
- Stanford Cars - 16K images, 196 car models - https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset
- FGVC Aircraft - 10.2K images, 100 aircraft variants - https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/
- Food-101 - 101 food categories - https://www.kaggle.com/datasets/dansbecker/food-101
- CUB-200-2011 - 12K bird images, 200 species - https://www.vision.caltech.edu/datasets/cub_200_2011/
- Stanford Dogs - 20K images, 120 dog breeds - http://vision.stanford.edu/aditya86/ImageNetDogs/
- MIT Indoor Scenes - 15K images, 67 indoor categories - http://web.mit.edu/torralba/www/indoor.html
- PASCAL VOC 2012 - 11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
- MS COCO - 328K images, 80 object classes, 91 stuff categories, 5 captions per image, 250k people with keypoints https://cocodataset.org/
- Cityscapes - 5K fine/25K coarse annotations, 8 classes - https://www.cityscapes-dataset.com/, https://www.cityscapes-dataset.com/dataset-overview/#class-definitions
- ADE20K - 25K images, 150 classes - https://groups.csail.mit.edu/vision/datasets/ADE20K/
- PASCAL Context - 10K images, 459 classes - https://cs.stanford.edu/~roozbeh/pascal-context/
- SBD (Semantic Boundaries) - 11K images from PASCAL - https://paperswithcode.com/dataset/sbd
- NYUDv2 - 1.4K RGB-D images - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- SUN RGB-D - 10K RGB-D images - https://rgbd.cs.princeton.edu/
- KITTI Semantic - http://www.cvlibs.net/datasets/kitti/
- PASCAL VOC 2012 - 10K/11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/
- MS COCO - 328K images, 80 classes, 1.5M instances - https://cocodataset.org/
- KITTI Object - http://www.cvlibs.net/datasets/kitti/
- Open Images (v1 in 2016) - 15.8 images, 6000 classes - https://storage.googleapis.com/openimages/web/index.html
- WIDER Face - 32K images, 393K face annotations - http://shuoyang1213.me/WIDERFACE/
- NYUDv2 - 1.4K RGB-D scenes - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- KITTI Depth- http://www.cvlibs.net/datasets/kitti/
- Make3D - 534 images with depths - http://make3d.cs.cornell.edu/data.html
- Sintel - http://sintel.is.tue.mpg.de/
- KITTI Flow - http://www.cvlibs.net/datasets/kitti/
- Flying Chairs - 22K synthetic pairs - https://lmb.informatik.uni-freiburg.de/resources/datasets/FlyingChairs.en.html
- Middlebury - Small but precise benchmark - https://vision.middlebury.edu/flow/
- MPII Human Pose - 25K images, 40K people - http://human-pose.mpi-inf.mpg.de/
- FLIC - 5003 images from movies - https://bensapp.github.io/flic-dataset.html
- Leeds Sports Pose - https://www.kaggle.com/datasets/dkrivosic/leeds-sports-pose-lsp
- LFW (Labeled Faces in the Wild) - 13K images, 5.7K people -https://www.kaggle.com/datasets/jessicali9530/lfw-dataset
- CelebA - 200K images, 10K identities - http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
- MegaFace - 1M images, 690K identities - http://megaface.cs.washington.edu/
- VGGFace - 2.6K people - https://www.robots.ox.ac.uk/~vgg/data/vgg_face/
- UCF-101 - 13,320 videos, 101 actions - https://www.crcv.ucf.edu/data/UCF101.php
- HMDB-51 - 6800 videos, 51 actions - https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
- Sports-1M - 1M YouTube videos, 487 sports - https://cs.stanford.edu/people/karpathy/deepvideo/
- ActivityNet - 20K videos, 200 classes - http://activity-net.org/
- WIDER Attribute - http://mmlab.ie.cuhk.edu.hk/projects/WIDERAttribute.html
- Berkeley Attributes - https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/poselets/
- Classic control tasks
  - OpenAI Gym (cartpole, mountaincar, acrobat, etc). I remember this before chatgpt lol maybe I’m old
  - MuJoCo (Multi-joint dynamics with contact) like the halfcheetah, hopper, humanoid, etc. This was typically done in a physics simulation and was popular for PPO.
- Board games
  - Go
  - Chess
  - PyGame
- TORCS
- Minecraft
- ViZDoom
- Atari 2600 from DeepMind
[3] Scaling Laws Paper, Larger pretrained models paper
- "Scaling Laws for Neural Language Models" by Jared Kaplan et al. (2020): https://arxiv.org/abs/2001.08361
- "Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level”: https://arxiv.org/abs/2105.06020
[4] Modern pretraining algorithms Pretext Tasks: Contrastive Learning Methods: Masked Modeling: Multimodal Learning:
- Rotation prediction
- Jigsaw puzzles
- Colorization
- Inpainting/Masked patches
- SimCLR (Chen et al., 2020): "A Simple Framework for Contrastive Learning of Visual Representations" [2002.05709] A Simple Framework for Contrastive Learning of Visual Representations
- MoCo v1 & v2 (He et al., 2019/2020): "Momentum Contrast for Unsupervised Visual Representation Learning" [2003.04297] Improved Baselines with Momentum Contrastive Learning
- BYOL (Grill et al., 2020): "Bootstrap Your Own Latent"
- PIRL (Misra & van der Maaten, 2020): "Self-Supervised Learning of Pretext-Invariant Representations" Self-Supervised Learning of Pretext-Invariant Representations
- Masked Language Modeling (MLM): BERT (Devlin et al., 2018)
- Masked Autoencoder (MAE)
- CLIP (Radford et al., 2021): "Learning Transferable Visual Models From Natural Language Supervision" [2103.00020] Learning Transferable Visual Models From Natural Language Supervision
- ALIGN (Jia et al., 2021)
- DALL-E (Ramesh et al., 2021): "Zero-Shot Text-to-Image Generation"
[5] Pretraining datasets
- JFT-300M: google’s internal 300M images psudeo labeled: https://ar5iv.labs.arxiv.org/html/1707.02968 (TO VERIFY)
- LAION-5B: 5.85 billion (image, text) pairs scraped from Common Crawl
- CLIP Training Data: 400M (image, text) pairs https://arxiv.org/abs/2103.00020 (not released)
- Wikipedia: English 20GB
- Kinetics-700: 650k videos (technically has action classes but still used)
[6] Improving Language Understanding by Generative Pre-Training https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[7] Video generation models as world simulators: https://openai.com/index/video-generation-models-as-world-simulators/
[8] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism https://arxiv.org/abs/2401.02954
[9] Constitutional AI: Harmlessness from AI Feedback https://arxiv.org/abs/2212.08073
[10] Segment anything: https://arxiv.org/abs/2304.02643, SAM 2: Segment Anything In Images & Videos https://arxiv.org/pdf/2408.00714. More details below.
[11] https://techcrunch.com/2025/06/13/new-details-emerge-on-metas-14-3b-deal-for-scale/
[12] https://www.nature.com/articles/s41586-025-09227-0
[13] "Learning Transferable Visual Models From Natural Language Supervision” https://arxiv.org/abs/2103.00020
[14] "Zero-Shot Text-to-Image Generation” https://arxiv.org/abs/2102.12092
[15] "Emerging Properties in Self-Supervised Vision Transformers” https://arxiv.org/abs/2104.14294, "DINOv2: Learning Robust Visual Features without Supervision” https://arxiv.org/abs/2304.07193
[16] “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” https://arxiv.org/abs/1810.04805
[17] Waymo E2E Open dataset https://waymo.com/open/data/e2e#camera-data

Data scale is NOT all you need