תקציר מנהלים

הדוגמה עבור חברות AI היא שיותר נתונים מובילים לביצועים טובים יותר, אך למעשה קנה מידה של נתונים אינו כל מה שצריך. נתונים באיכות גבוהה מניבים ביצועים טובים יותר בהשוואה למערך נתונים גדול יותר באיכות נמוכה. ייצור נתונים באיכות גבוהה דורש סינון דרך רעש, הבנת נתונים לא מתויגים, והבנה מה לתייג. תיוג נתונים מסיבי על ידי פלטפורמות אנוטציה הוא גם בעייתי מכיוון שהתמריצים שלהם לעתים קרובות אינם מיושרים והפלטפורמה שלהם היא צוואר בקבוק שגוזל זמן, נוטה לשגיאות ויקר. הדרך הטובה ביותר לשפר מערכות AI היא להבין את הנתונים המזינים מודלים על ידי ייצוג אינטליגנטי של מערכי נתונים בצורה שניתן לקיים איתה אינטראקציה באמצעות למידת ייצוג עצמית מפוקחת, מודלים בסיסיים וסינון. שיטות אלה מונעות את הסיכון של ביצועים גרועים במערכות AI ואת הסיכון של יצירת פלטים מזיקים.

פחות זה יותר

קנה מידה של נתונים אינו כל מה שצריך. הגדלה עיוורת של גודל מערך נתונים תוך כדי אימון מקדים של מודל מעמידה חברות AI-first בסיכון לביצוע טעויות חמורות. אימון מודלים על מערכי נתונים גדולים עם התפלגות לא ידועה מוביל להתנהגויות בלתי צפויות: ברובוטיקה זה יכול להוביל למסלולים שגויים ומסוכנים, עבור חברת בריאות להערכות סיכון לא מדויקות, ועבור LLMs ליצירת דיבור מזיק {9}. ב-X, Grok עשה את הטעות הזו, ויצר דיבור מזיק בפוסט שנמחק כעת המוצג באיור 0a. אפילו מנכ"ל xAI הודה שהם צריכים להיות יותר "סלקטיביים לגבי נתוני אימון, במקום פשוט להתאמן על כל האינטרנט". אבל איך בוחרים נתונים כראוי כדי לאמן ולהעריך מודלים אלה כראוי? אילו כלים קיימים?

הפתרון הוא לייצג נתונים באופן אינטליגנטי בצורה שניתן לקיים איתה אינטראקציה ושמגוונת מספיק מבחינה סמנטית. גישה זו עוזרת: 1. ליצור מערכי נתונים לאימון והערכה הן לאימון מקדים והן לאימון לאחר מכן, 2. לזהות חורים בנתונים ו-3. להמליץ כיצד למלא את הפערים הללו (על ידי קנייה או איסוף).

Figure 0a: Examples of an LLM generating harmful speech likely due to existence of similar text in the training data the xAI team used to train Grok.

Figure 0b: Reaction from the xAI CEO after Grok generated harmful speech. The interesting piece is the teams focus on being selective of the training data. Original post from the Grok CEO https://x.com/elonmusk/status/1944132781745090819

גלגלי נתונים {10} וחברות אנוטציה

בתעשייה, רוב מנכ"לי חברות AI, חוקרי AI ומהנדסים אינם מרוצים מחברות אנוטציה מודרניות המשלבות את עצמן בגלגלי הנתונים שלהם.

הפתרון הנוכחי עבור חברות AI הוא לצבור מערך נתונים גדול לא מתויג לאימון מקדים (או להשתמש במודל מאומן מראש בקוד פתוח), לאחר מכן לתייג מערך נתונים גדול נוסף ספציפי למשימה המיועדת, ולבסוף לאצור ידנית סט אימון וסט הערכה. התיוג בדרך כלל מועבר במיקור חוץ לחברות אנוטציה (ScaleAI, SuperAnnotate, Labelbox וכו') המשלבות את עצמן במנוע הנתונים. אבל תיוג של הכל במערך נתונים גדול לא עובד טוב כי הרחבת תיוג נתונים למיליוני או מיליארדי דוגמאות נוטה לשגיאות, יקרה באופן בלתי בר-קיימא וגוזלת זמן ומשאירה חברות AI לא מרוצות. אבל חשוב יותר, לולאת התיוג היא תהליך שאינו נגמר מכיוון שגלגלי נתונים מסתגלים ברציפות למודלים מתפתחים ולנתונים נוספים שנאספים, מה שהופך את דרישות התיוג לנזילות ומשתנות לאורך זמן; חברות אנוטציה לא יכולות לעמוד בקצב השינויים מכיוון שעדכוני מודל יכולים לקרות תוך שבועות בעוד שתיוג יכול לקחת חודשים.

לולאת התיוג המודרנית במנוע נתונים היא:

איסוף נתונים מסוימים.
עיצוב או עדכון של מפרט תיוג כלשהו.
שליחת הנתונים והמפרט לחברת תיוג כלשהי (Scale, SuperAnnotate וכו'). תשלום עבור התיוג.
איטרציה עם חברת התיוג ואימון המודל.
התבוננות בתוצאות ולאחר מכן חזרה על שלבים 2-5 ללא הגבלה.

לדוגמה, חברת נהיגה אוטונומית עשויה לרצות לתייג תמרורי עצור אך לאחר תיוג של מיליון תמרורי עצור וראיית התוצאות הם מבינים שהם רוצים לתייג את "הנראות" של תמרור העצור, ואז הם מבינים שהם גם רוצים לתייג עצים שעשויים להקיף תמרורי עצור ולהוסיף תווית "מוסתר". כעת כל הנתונים (שגם גדלו בינתיים מכיוון שאיסוף נתונים הוא רציף) צריכים להיות מתויגים מחדש! המחזור לעולם לא יסתיים כל עוד חברה משפרת את המודל שלה!

ההשקעה של Meta של 14.3 מיליארד דולר עבור 49% ממניות כדי לשכור את מנכ"ל Scale.AI [11] עשויה להיות אחד המהלכים המסוכנים ביותר שהחברה אי פעם עשתה בגלל הקשיים הללו עם חברות תיוג.

אז, אם אימון עיוור על מערכי נתונים עצומים הוא בעייתי, ותיוג של הכל הוא קשה, מה עוד עלינו לעשות? לאחר עבודה על נושא זה במשך ארבע השנים האחרונות, מצאנו שהפתרון הטוב ביותר הוא לייצג נתונים מספיק טוב כך שיהיה קל יותר לבחור ולהבין מה יש בנתונים שלנו וכיצד הנתונים הללו משפיעים על המודלים שלנו. אנחנו צריכים להיות מסוגלים לשוחח עם הנתונים שלנו בצורה שמאפשרת לנו לחפש במהירות דוגמאות ולבנות במהירות סטים של הערכה לבדיקת מודלים.

זה מה שאנחנו בונים ב-Interpret AI. אנחנו בונים פלטפורמת התבוננות פנימית בנתונים, פלטפורמת אוצרות נתונים ושוק נתונים אינטליגנטי המאפשר לחברות הבונות מערכות AI לקיים אינטראקציה ולהבין את מערכי הנתונים שלהן. אנו רואים עולם שבו אתה יכול לשוחח עם הנתונים שלך באמצעות שפה טבעית, אודיו, תמונה ווידאו כדי לחפש מופעים דומים כך שחברות יכולות לסמוך ולדעת את הנתונים שלהן (או את הפערים בנתונים שלהן) שמניעים את המודלים שלהן. (אם משהו מזה מהדהד אצלך אנא אל תהסס לפנות אל ily@interpretai.tech)

קנה מידה של מה שכנראה מועיל תחילה

גלגלי נתונים מסורתיים

Figure 1a: The traditional data engine powering AI solutions in companies.

לחברה יש תשתית כלשהי שאוספת כל הזמן נתונים למערך נתונים (1b). צוות אז יוצר תת-קבוצות נתונים היוריסטיות שבתקווה ברגע שיתויגו ישפרו את המודל שלהם (1a).
הנתונים נשלחים לחברת התיוג (אנוטציה). חברת התיוג מייצרת תוויות (אנוטציות) שלאחר מכן נבדקות על ידי הצוות, מה שיכול לקחת חודשים של הלוך ושוב כדי להתכנס.
מודל ה-AI המאומן מראש אז מאומן מראש.
המודל המאומן מראש אז מכוונן באמצעות התוויות מחברת התיוג
המודל הסופי מוערך באמצעות מערכת ההערכה של החברה, ומייצר מדדים.
החברה אז משתמשת במשוב הזה כדי אולי לבחור תת-קבוצות נתונים אחרות, לעדכן את דרישות התיוג, ו/או לבצע שינויי מודל. שים לב שבשלב זה תת-קבוצת הנתונים כבר הולכת ומתיישנת.

הערה: מדדים עשויים להיות מוטים על ידי אנוטציות גרועות הדורשות איטרציה מתמדת מהצוות שהיא גם יקרה וגם לא יעילה בזמן (6).

Figure 1b: A breakdown of the time requirements for different processes in a traditional company’s approach to solutions. Notice that the major bottleneck is getting labels from a labeling company.

איור 1b: מגבלות זמן והגדרה של מערכת AI של חברה מסורתית עם לוחות זמנים משוערים לאיטרציה של כל אחד מהחלקים הללו באופן עצמאי. שים לב שעם חברת תיוג בלולאה, ייקח חודשים של איטרציה ליצור תוויות שמשפרות כראוי מודל AI. ראה איור1a לאופן שבו כל אחד מהחלקים הללו מקיימים אינטראקציה עם חברה מסורתית.

גלגל הנתונים של Interpret AI:

התחל לדעת עם תובנות נתונים עמוקות

Figure 2a: Interpret’s AI data flywheel & how we provide immediate data insights.

איור 2a: גלגל הנתונים של Interpret AI.

המלצות מיידיות על תת-קבוצות נתונים והצעות נתונים משופרות לאימון מקדים ואימון (1a ו-1b בהתאמה).
הצוות כעת בודק תת-קבוצות נתונים קטנות משמעותית שהוצעו על ידי Interpret לפני שליחה לחברת תיוג. תת-קבוצות נתונים אלה הן נזילות ומתעדכנות ברציפות ככל שהנתונים משתנים (באופן אופציונלי, אם חברה משלבת את מודל הבסיס שלה, Interpret AI יכולה לספק תובנות נוספות על האופן שבו הנתונים משפיעים על ביצועי המודל).
ההלוך ושוב בין חברת תיוג מואץ מחודשים לשבועות והוא זול משמעותית מכיוון שמפרטי האנוטציה ובחירת מערך הנתונים ברורים.

המשוב ממוקד במודל (6).
לבסוף, Interpret AI מנתחת את מרחב הנתונים שלך כדי לספק תובנות על אילו נתונים לאסוף או לקנות כדי להאיץ את שיפור המודל.

Figure 2b: A breakdown of the time requirements for different processes in using Interpret’s platform. On the left hand side feedback iteration speed in green is accelerated. Notice there is no more bottleneck.

איור 2b: האיור מדגים כיצד Interpret AI משתלבת ישירות עם הלקוחות שלנו כדי להאיץ אימון מודל, מיון והבנת נתונים והערכה. Interpret AI מספקת פתרונות עבור

הבנת התפלגות הנתונים הקיימת.
זיהוי פערי מודל המתואמים עם פערי נתונים.
קנייה ואוצרות נתונים למילוי פערי נתונים.

מקרי שימוש

אנו משתפים פעולה עם מספר עסקים בתעשיות רובוטיקה, בריאות ו-LLM אגנטיים. אם משהו מאלה מהדהד אצלך אנא אל תהסס לפנות אל ily@interpretai.tech

בריאות

HealthCo מנסה לחזות את הסיכון למחלות לב וכלי דם עבור המטופלים שלהם.

לאימון

Interpret AI מנתחת נתוני לב וכלי דם באמצעות מודלי הבסיס של interpret שלנו, מעבדת EHRs, תמונות, אולי נתוני ECG [12] אם זמינים.
Interpret AI מבחינה באנומליות או "חורים" ב-HealthCo ומתארת את הדמוגרפיה של האנשים הללו (כלומר נשים, בגיל העמידה, ללא ילדים, שנרשם להם היסטורית trimetazidine).
רשומות אלה שזוהו מנותחות עוד יותר על ידי מומחים. הנתונים שנבחרו יכולים אז להתעדכן, להתעלם, לשמש לעזרה ברכישת נתונים נוספים של אנשים שנרשם להם היסטורית trimetazidine, או להישלח לחברת תיוג כדי לבצע אנוטציה לקבוצה ספציפית זו.
הנתונים שנבחרו משמשים אז לאימון מודל AI למחלות לב וכלי דם. אם HealthCo משלבת את מודל הלב וכלי הדם שלה בפלטפורמת Interpret אז אנו מנתחים עוד יותר היכן המודל מתפקד בצורה גרועה בזמן אמת, ומאפשרים התבוננות פנימית מיידית.
תהליך זה מקטין את ציר הזמן של אימון המודל מסדר גודל של חודשים לשבועות, משפר במהירות מערכות AI וחוסך עלויות!

לבטיחות

נניח ש-HealthCo יש דוגמאות של אנשים שסבלו מהתקפי לב והם רוצים לנתח EHRs אחרים של אנשים שדומים לאדם זה שגם עשויים להיות בסיכון

באמצעות Interpret AI, HealthCo יכולה לבחור דוגמאות של אדם זה ולחפש מאגר קשור של אנשים, למיין לפי רמת ביטחון.
אנשים אלה יכולים להיות מסומנים כבסיכון, ולזהות במהירות כמה מאות אנשים בסיכון ממיליוני רשומות!

רובוטיקה

DriveCo בונה מכוניות מרוץ אוטונומיות כצעצוע לילדים לשחק איתו בחוץ.

לאימון

Interpret AI מנתחת את הריצות שנאספו של נתוני וידאו של מכוניות מרוץ. Interpret AI נותנת דוח נתונים.
Interpret AI מבחינה שרוב השידורים החוזרים מהסרטונים אינם מגוונים גיאוגרפית ושיש מעט דוגמאות של מכוניות מרוץ נוסעות בחוץ בחצרות אחוריות.
Interpret AI ממליצה לצוות DriveCo לאסוף דוגמאות נוספות של סרטוני חוץ. אנו גם מנסים לאזן את מערך הנתונים בצורה נלמדת באמצעות מודל הבסיס של Interpret AI כדי להקל על חוסר האיזון הזה.
- ללא Interpret AI, DriveCo עשויה הייתה לשלוח למעלה מ-1000 שעות של נתוני מכוניות מרוץ לתיוג אובייקטים שלא היה צורך בהם! כעת הם צריכים לתייג רק 10 שעות!

לבטיחות

נניח שמכוניות המרוץ האוטונומיות הללו מתמודדות עם ביקורת על בטיחות תינוקות.

DriveCo יכולה לחפש במסד הנתונים שלה סרטונים המכילים "תינוק" כדי לראות אם יש להם את הנתונים הללו.
אם ל-DriveCo אין את הנתונים זה מודיע לצוות לאסוף אותם (באמצעות אולי תינוקות מזויפים אני מקווה) או שזה מאפשר ל-DriveCo להראות לצרכנים ולמשקיעים שהמוצר הוא למעשה בטוח סביב תינוקות!

איך הגענו לכאן

היסטוריה קצרה על תוויות ואימון מקדים

בשנת 2015, לפני Transformers, רוב המודלים אומנו לפתור תת-קבוצה מאוד מסוימת של בעיות: סיווג, פילוח, זיהוי אובייקטים (כלומר בעיות בסיסיות) ואחרות [1]. אמות מידה היו מערכי נתונים מתויגים "גדולים למדי" בסדר גודל של 10k עד 1M. {1}

אימון מקדים מודרני נכנס לשיחה בסביבות 2017 ושינה את המשחק. בהשאלה מלמידת ייצוג, אימון מקדים הגיע כשינוי פרדיגמה יסודי שבו פתאום מערכי נתונים לא מתויגים פתחו רווחים עצומים בביצועי המודל. מערכי הנתונים הלא מתויגים ששימשו לאימון מקדים בהשוואה לאחיהם המתויגים היו מסיביים [5]. זה בשילוב עםטכניקות והתקדמויות אחרות {2} הוביל למודלים בסיסיים מודרניים כמו CLIP [13], DALL-E [14], DINOv2 [15], ו-BERT [16] אם נזכיר רק כמה.

אז OpenAI, שנבנתה על בסיס של transformers, אימון מקדים והתקדמות בלמידת חיזוק, שינתה את המשחק כשהם שחררו את GPT (generative pre-trained transformer) [6]. Sora [7], DeepSeek [8], Anthropic [9] כולם משתמשים באימון מקדים על מערכי נתונים גדולים כעמוד השדרה למודלים הביצועיים שלהם. אבל מוסתר שם יש תצפית חדה שרוב האנשים לא מדברים עליה.

בעוד שאימון מקדים הוא צעד ראשון טוב, רוב המודלים הללו זקוקים לאימון נוסף על גבי בסיס מאומן מראש. בין אם זה RL או כוונון מפוקח, המודלים הביצועיים ביותר מיושרים {3} איכשהו לבעיה המקורית. אבל אפילו כוונון מתרחב עד נקודה מסוימת, כלומר שיפור אימון מקדים הוא חיוני לביצועי מודל עתידיים {4}.

אחת הדוגמאות המשכנעות ביותר לאופן שבו לשלב כראוי אימון מקדים ולבנות גלגל נתונים בספרות היא גלגל הנתונים המתויג שנבנה על ידי Meta ב-Segment Anything Model (SAM) ו-SAM v2 [10]. אבל אפילו בדוגמה זו, תיוג נתונים קשה להפליא להרחבה.

Segment Anything: החידושים והמסר

תקציר מנהלים: מה ש-SAM מראה לנו הוא שהבטחת איכות והבנה של מה יש בנתונים שלנו היא קשה אבל בעיהחשובה לטיפול. הוספת נתונים נוספים אינה בהכרח התשובה.

SAM בנתה גלגל נתונים שאצר מערך נתונים מתויג גדול באמצעות SAM מאומן חלקית בשלבים שונים של אימון עם משוב תווית אנושי. הגישה שלהם ממחישה את הדרך הנכונה לשלב תיוג בצינור אבל גם מדגישה שאפילו גלגל תיוג הנתונים הנכון הוא יקר ומאתגר להרחבה. בשלב מסוים, מערך הנתונים גדל מספיק כך שבני אדם לא יכולים לבצע אנוטציה להכל ולכן דורש שיטה אחרת של התבוננות פנימית (כלומר מה ש-Interpret בונה).

בערך, הגישה של SAM הייתה [10]

התחל עם MAE מאומן מראש היררכי ViT.
אמן את SAM על מערכי נתונים של פילוח זמינים לציבור.
השתמש ב-SAM המאומן חלקית כדי ליצור מסכות פילוח על תת-קבוצת נתונים.
בקש מבני אדם לשכלל את תחזיות הפילוח. לאחר מכן השתמש גם במסכות כדי לאמן גלאי אובייקטים כדי למצוא אובייקטים נוספים ובקש מבני אדם לתייג את זה.
חזור על שלבים 3-4 תוך הגדלה הדרגתית של גודל מערך הנתונים
סיים על ידי הרצה על מיליארד תמונות כדי לקבל SA-1B. השתמש בצוות QA כדי לסמן דוגמאות שעלולות להיות גרועות. שים לב שמתן תוויות אנושיות לכל מיליארד התמונות קשה להפליא.

הרעיון זהה עבור SAM 2 שהוא מודל פילוח וידאו, שיצר מערך נתונים SA-V עם 35.5M מסכות על פני 50.9K סרטונים, פי 53 יותר מסכות מכל מערך נתונים של פילוח וידאו [10].

שים לב, מודל הפילוח הטוב ביותר אומן עם נתונים הקשורים ישירות למשימה שלו שבה משוב התווית היה כולו מחובר יפה בגלגל נתונים מהיר ויעיל. אימון מקדים ואז **אימון עם אוסף של מערכי נתונים של פילוח בקוד פתוח היו רק השלב הראשון והשני.

שים לב גם שתיוג אנושי בסופו של דבר הגיע לתקרה; כאשר גלגל הנתונים התחיל לתייג מיליארד תמונות Meta עדיין הייתה צריכה להריץ מסנן QA כדי לסמן דוגמאות גרועות. בהתבסס על המאמר, ביצוע אנוטציה לכל 1.1B מסכות היה לוקח 51k ימי אנוטציה! {5}

זו Meta שאנחנו מדברים עליה אבל שכירת זה עבור רוב החברות תהיה יקרה להחריד ובלתי אפשרית! {6} תיוג בקנה מידה הזה פשוט קשה!

חזרה על תקציר המנהלים, מה ש-SAM מראה לנו הוא שהבטחת איכות והבנה של מה יש בנתונים שלנו היא קשה אבל בעיה חשובה לטיפול. זה הוא בעצם הפער שאנו רואים בתעשייה היום: נתונים נוספים המשמשים לאימון מקדים או כוונון אינם בהכרח התשובה. הגישה הנכונה מזהה

Annotation companies’ goals are not necessarily aligned with yours…

We have industry experience in MAANG and our team has experience working with annotation companies like Scale, SuperAnnotate, etc. For most labeling (annotation) companies, the business model is:

Let companies generate their own labeling (annotation) spec with perhaps some back & forth depending on the complexity of the labels.
Most annotation companies have different tiers of annotators, the largest pool being non-experts who label everything and the smallest being experts in the field (i.e. Doctors). An annotation company then marshals a pool of human labelers, typically starting with the cheapest ones to do a low quality first pass.
The annotators then label according to the company’s complex annotation spec as best they can, charging per annotation.
Provide feedback and updates to the annotations, possibly updating the annotation spec.

There are four main problems with this process:

annotations are not consistent and are usually not assigned to the right labelers,
the labeling is time-consuming & expensive,
the feedback loop for correcting annotations is erroneous, and
annotation specs change over time as model performance changes.

Addressing 1., labelers are not guaranteed to be suited to their assigned labeling task and often label differently than their peers. For instance, for a healthcare company if the task is “Pick the clinical response that bests diagnoses the patient” these labelers may not even be doctors suited to the task! Additionally, for an autonomous driving company if the task is to “Draw bounding boxes for stop signs” does this include the pole or not? What if it’s the back side of a stop sign? Different annotators will label differently without consulting each other.

Addressing 2., charging per annotation sounds great in theory as the conventional dogma is that more labels help but if and only if the company can afford the cost of a sufficient number of labels to boost model performance; a number that is typically unknown. These annotations will also typically have errors that require AI companies to build internal systems that review the annotations which takes both time (order of months) and more money.

Addressing 3., The feedback loop is not consistent either. Typically the responsibility of annotation verification is pushed to the AI company, which needs to set up their own internal monitoring system (already time-consuming and costly). When an AI company notices an annotation issue, corrections are not guaranteed to be from the same annotator who created the problematic label and sometimes annotation companies will relabel the entire problematic example instead of correcting it which costs more. For instance, for an autonomous driving company might want to label instance masks of traffic lights and people. In this dummy example, the first annotator makes a mistake and forgets to label traffic lights not facing the camera. The AI company flags it and sends it off to be re-reviewed but the way the annotation company fixes this is by sending the image to a new annotator who relabels everything from scratch! The second annotator fixes the original issue but doesn’t label policeman as “people” and now a new issue emerges! See Figure 3a and Figure 3b. This loop has an incredibly low probability of correctly annotating objects correctly ~61% for 50 labels {7}.

Figure 3a: First pass by the first annotator who missed the traffic lights that are not facing the camera. (Image from Waymo Open Dataset [17])

Figure 3b: Second pass from the second annotator who got all the traffic lights but didn’t realize that the “people” class included police officers! (Image from Waymo Open Dataset [17])

Essentially, with this feedback system the labels an annotation company creates are not guaranteed to converge to the right labels!

The incentives of AI companies are not well aligned with those of labeling companies. AI companies want to improve their AI model and their product while annotation companies want to label as much company data as possible so that they can charge for it. You want to make your model performant and so should annotation companies.

Addressing 4., In industry (and research), when trying to solve a problem, there are many possible solutions. Perhaps pretraining on the entire internet will improve your LLM, or perhaps grounding an LLM by training on labeled text-images pairs will help with LLM reasoning, or perhaps adding chain of thought will help. In other words, when designing AI systems we need to try a lot of different things in parallel since sometimes it’s unclear what the best approach will be. Labeling is one solution, which means that as we better understand our problem the label definition is subject to change.

For instance, take labeling stop signs in autonomous driving; suppose that we first label stop signs. We notice that performance improves when we know if a stop sign is partially obstructed, so we update the annotation spec to add a metadata tag called “obstructed” later on when the sign is partially or not visible. We then go back to an annotation company and ask them to relabel all our stop signs with this! This “annotation-platform in the loop” means that every model experiment that updates the labeled dataset is super expensive!

So, one may wonder, why are labeling providers used at all? For two reasons: First, high quality labels on data do help as discussed earlier. In fact, less data with higher quality labels can outperform some of these large pretrained models; SAM being an excellent example. Second, the alternative to not using an annotation company is to create an internal annotation platform which is even more expensive and time consuming, since producing the same volume of labels as the other players can take years!

Conclusion

The optimal data flywheel represents data in a form that’s inherently insightful and interactable: we should be able to detect anomalies and also chat with our data to garner interesting patterns and insights. This flywheel should enhance annotation platforms by focusing on what should be labeled instead of labeling everything {8}. And finally, this data flywheel should align with model performance, tying directly to whatever problem your AI company is solving.

The traditional dogma is that more data “just works” and sometimes deep learning feels like alchemy. Perhaps more data will work for you in the short run but when things “just don’t work” the proper way is to assess failure both in the data & the model and work from there.

Over at Interpret we hope to change the paradigm. If you are interested, reach out to us at ily@interpretai.tech

Footnotes

Back when AlexNet was still a thing circa 2015ish most models for computer vision were trained on a subset of very particular problem types: classification, segmentation, object detection (ie foundation problems) and others like image captioning, scene recognition, pose estimation (see appendix for more details)[1]. Note this was pre “Attention is all you need” when bigrams were a-la-mode. The focus then was model development while benchmarks remained fixed. These benchmarks were “largish” labeled datasets (order of 10k to 1M) that were used to evaluate model performance. Some of the popular CV benchmarks you’re probably familiar with are MNIST, ImageNet, MS COCO, KITTI, Caltech-101 [2]. If you look the largest labeled datasets around this time they were around 1M labels, and that was considered large at the time.
Modern pretraining entered the chat around 2017 and changed the game. Borrowing from representation learning, pretraining came as a fundamental paradigm shift from learning features for only a specific labeled dataset to learning general features on unlabeled data that correlated well with other problems like classification, segmentation, object detection. These datasets compared to their labeled brethern were massive [5]. At the same time, advancements in model training (CUDA optimization which is why NVIDIA hit a 4T market cap), deep learning libraries (tensorflow, pytroch), and new / improved model architectures like Transformers from “Attention Is All You Need” opened up a brand new world. Researchers also noticed that increasing the size of models typically correlated with improved performance on unseen data (from the same data distribution). All of this combined interfaced with modern pretraining algorithms like pretext tasks, contrastive learning, masked label modeling, masked autoencoding (MAE) multimodal modeling [4] unlocking the era of training big models on even massive unlabeled datasets. Ergo, models like CLIP [13], DALL-E [14], DINOv2 [15], BERT [16].
”Alignment” is an overused term I mean alignment in both the “we want our LLM to be helpful not harmful” sense and the “data distribution alignment” sense.
When training / fine-tuning a model, scaling model size correlates with improvement in performance roughly following a power law. In industry, we’re already hitting the peak for model size scaling laws and fine-tuning is giving less and less of an advantage. The next frontier is improving pretraining method to better utilize existing unlabeled datasets.
In the SAM paper, annotations could take 30 seconds (but suppose it took 4 seconds based on the improvements from SAM v2 [10]); reviewing 1.1B masks would’ve required 1,100,000,000 * 4 seconds = ~51,000 days of annotation time!
This is also assuming that the data distribution is stationary (unchanging). If we wanted to increase the labels to a different data distribution (say deep sea diving videos where the semantics & dynamics of objects is different) then finetuning SAM would still require the same data flywheel training process which is also more time and more money.
Suppose that each object has a probability of being mislabeled p=0.01 (ie an annotator labels incorrectly or misses a label once every 100 labels). Assuming 50 objects in a video the probability of succeeding assuming independence is (1 - p)^50 = 61% chance of success! And that’s conservative.
Fundamentally, when AI companies have better clarity on what to label their incentives align with annotation companies.
More and more it is clear very few samples (e.g. thousands) of very high quality data is way better than million of low quality data - this is particularly true in post-traning of LLMs in industry but it is starting to be the focus also of pre-training.
A data flywheel is the loop used to collect data, improve the model, which makes a better product, which then modifies what data to collect and the cycle repeats (for example this image from dataloop.ai https://dataloop.ai/book/the-data-flywheel-effect/). A data engine is the infra for collecting/labeling/evaluating data (for example Scale’s product https://scale.com/data-engine).

Special Thanks

Cameron Tukerman-Lee (also credit for the title)
Gabriele Sorrento
Francesco Pongetti
Lotfi Herzi

Appendix

[1] A more extensive list of popular 2015 foundational problems across different domains so sortof pre multi-modal.
- Computer vision
  - classification
  - segmentation
  - object detection
  - image captioning
  - scene recognition
  - pose estimation
  - Optical Flow Estimation
  - Depth Estimation
  - Face recognition
  - Pose estimation
  - Visual tracking
  - Style transfer
  - Image generation
- Natural Language Processing
  - Machine translation
  - Part of speech tagging
  - Question answering
- Speech Processing
  - Speech recognition
  - Speaker identification
  - Emotion classification
- Time series
- Reinforcement Learning
[2] Popular datasets separated by domain around 2015 Classification: Segmentation: Object Detection: Other Tasks: Depth Estimation: Optical Flow: Pose Estimation: Face Recognition: Video/Action Recognition: Attributes/Multi-label: Reinforcement Learning: Can think of dataset size as number of rollouts.
- ImageNet (ILSVRC 2017) - 1.2M training, 1000 classes - https://www.image-net.org/challenges/LSVRC/2017/index.php
- CIFAR-10/100 - 60K (32x32), 10/100 classes - https://www.cs.toronto.edu/~kriz/cifar.html
- MNIST - 70K handwritten digits - https://www.kaggle.com/datasets/hojjatk/mnist-dataset
- Fashion-MNIST - 70K fashion items - https://github.com/zalandoresearch/fashion-mnist
- SVHN - 600K real world house numbers 10 classes for each digit - http://ufldl.stanford.edu/housenumbers/
- Caltech-101/256 - 9K/30K images 101/256 categories - https://data.caltech.edu/records/mzrjq-6wc02, https://data.caltech.edu/records/nyy15-4j048
- Oxford Flowers 102 - 102 categories - https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
- Oxford-IIIT Pets - 7.4K images, 37 pet breeds - https://www.robots.ox.ac.uk/~vgg/data/pets/
- Stanford Cars - 16K images, 196 car models - https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset
- FGVC Aircraft - 10.2K images, 100 aircraft variants - https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/
- Food-101 - 101 food categories - https://www.kaggle.com/datasets/dansbecker/food-101
- CUB-200-2011 - 12K bird images, 200 species - https://www.vision.caltech.edu/datasets/cub_200_2011/
- Stanford Dogs - 20K images, 120 dog breeds - http://vision.stanford.edu/aditya86/ImageNetDogs/
- MIT Indoor Scenes - 15K images, 67 indoor categories - http://web.mit.edu/torralba/www/indoor.html
- PASCAL VOC 2012 - 11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
- MS COCO - 328K images, 80 object classes, 91 stuff categories, 5 captions per image, 250k people with keypoints https://cocodataset.org/
- Cityscapes - 5K fine/25K coarse annotations, 8 classes - https://www.cityscapes-dataset.com/, https://www.cityscapes-dataset.com/dataset-overview/#class-definitions
- ADE20K - 25K images, 150 classes - https://groups.csail.mit.edu/vision/datasets/ADE20K/
- PASCAL Context - 10K images, 459 classes - https://cs.stanford.edu/~roozbeh/pascal-context/
- SBD (Semantic Boundaries) - 11K images from PASCAL - https://paperswithcode.com/dataset/sbd
- NYUDv2 - 1.4K RGB-D images - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- SUN RGB-D - 10K RGB-D images - https://rgbd.cs.princeton.edu/
- KITTI Semantic - http://www.cvlibs.net/datasets/kitti/
- PASCAL VOC 2012 - 10K/11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/
- MS COCO - 328K images, 80 classes, 1.5M instances - https://cocodataset.org/
- KITTI Object - http://www.cvlibs.net/datasets/kitti/
- Open Images (v1 in 2016) - 15.8 images, 6000 classes - https://storage.googleapis.com/openimages/web/index.html
- WIDER Face - 32K images, 393K face annotations - http://shuoyang1213.me/WIDERFACE/
- NYUDv2 - 1.4K RGB-D scenes - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- KITTI Depth- http://www.cvlibs.net/datasets/kitti/
- Make3D - 534 images with depths - http://make3d.cs.cornell.edu/data.html
- Sintel - http://sintel.is.tue.mpg.de/
- KITTI Flow - http://www.cvlibs.net/datasets/kitti/
- Flying Chairs - 22K synthetic pairs - https://lmb.informatik.uni-freiburg.de/resources/datasets/FlyingChairs.en.html
- Middlebury - Small but precise benchmark - https://vision.middlebury.edu/flow/
- MPII Human Pose - 25K images, 40K people - http://human-pose.mpi-inf.mpg.de/
- FLIC - 5003 images from movies - https://bensapp.github.io/flic-dataset.html
- Leeds Sports Pose - https://www.kaggle.com/datasets/dkrivosic/leeds-sports-pose-lsp
- LFW (Labeled Faces in the Wild) - 13K images, 5.7K people -https://www.kaggle.com/datasets/jessicali9530/lfw-dataset
- CelebA - 200K images, 10K identities - http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
- MegaFace - 1M images, 690K identities - http://megaface.cs.washington.edu/
- VGGFace - 2.6K people - https://www.robots.ox.ac.uk/~vgg/data/vgg_face/
- UCF-101 - 13,320 videos, 101 actions - https://www.crcv.ucf.edu/data/UCF101.php
- HMDB-51 - 6800 videos, 51 actions - https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
- Sports-1M - 1M YouTube videos, 487 sports - https://cs.stanford.edu/people/karpathy/deepvideo/
- ActivityNet - 20K videos, 200 classes - http://activity-net.org/
- WIDER Attribute - http://mmlab.ie.cuhk.edu.hk/projects/WIDERAttribute.html
- Berkeley Attributes - https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/poselets/
- Classic control tasks
  - OpenAI Gym (cartpole, mountaincar, acrobat, etc). I remember this before chatgpt lol maybe I’m old
  - MuJoCo (Multi-joint dynamics with contact) like the halfcheetah, hopper, humanoid, etc. This was typically done in a physics simulation and was popular for PPO.
- Board games
  - Go
  - Chess
  - PyGame
- TORCS
- Minecraft
- ViZDoom
- Atari 2600 from DeepMind
[3] Scaling Laws Paper, Larger pretrained models paper
- "Scaling Laws for Neural Language Models" by Jared Kaplan et al. (2020): https://arxiv.org/abs/2001.08361
- "Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level”: https://arxiv.org/abs/2105.06020
[4] Modern pretraining algorithms Pretext Tasks: Contrastive Learning Methods: Masked Modeling: Multimodal Learning:
- Rotation prediction
- Jigsaw puzzles
- Colorization
- Inpainting/Masked patches
- SimCLR (Chen et al., 2020): "A Simple Framework for Contrastive Learning of Visual Representations" [2002.05709] A Simple Framework for Contrastive Learning of Visual Representations
- MoCo v1 & v2 (He et al., 2019/2020): "Momentum Contrast for Unsupervised Visual Representation Learning" [2003.04297] Improved Baselines with Momentum Contrastive Learning
- BYOL (Grill et al., 2020): "Bootstrap Your Own Latent"
- PIRL (Misra & van der Maaten, 2020): "Self-Supervised Learning of Pretext-Invariant Representations" Self-Supervised Learning of Pretext-Invariant Representations
- Masked Language Modeling (MLM): BERT (Devlin et al., 2018)
- Masked Autoencoder (MAE)
- CLIP (Radford et al., 2021): "Learning Transferable Visual Models From Natural Language Supervision" [2103.00020] Learning Transferable Visual Models From Natural Language Supervision
- ALIGN (Jia et al., 2021)
- DALL-E (Ramesh et al., 2021): "Zero-Shot Text-to-Image Generation"
[5] Pretraining datasets
- JFT-300M: google’s internal 300M images psudeo labeled: https://ar5iv.labs.arxiv.org/html/1707.02968 (TO VERIFY)
- LAION-5B: 5.85 billion (image, text) pairs scraped from Common Crawl
- CLIP Training Data: 400M (image, text) pairs https://arxiv.org/abs/2103.00020 (not released)
- Wikipedia: English 20GB
- Kinetics-700: 650k videos (technically has action classes but still used)
[6] Improving Language Understanding by Generative Pre-Training https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[7] Video generation models as world simulators: https://openai.com/index/video-generation-models-as-world-simulators/
[8] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism https://arxiv.org/abs/2401.02954
[9] Constitutional AI: Harmlessness from AI Feedback https://arxiv.org/abs/2212.08073
[10] Segment anything: https://arxiv.org/abs/2304.02643, SAM 2: Segment Anything In Images & Videos https://arxiv.org/pdf/2408.00714. More details below.
[11] https://techcrunch.com/2025/06/13/new-details-emerge-on-metas-14-3b-deal-for-scale/
[12] https://www.nature.com/articles/s41586-025-09227-0
[13] "Learning Transferable Visual Models From Natural Language Supervision” https://arxiv.org/abs/2103.00020
[14] "Zero-Shot Text-to-Image Generation” https://arxiv.org/abs/2102.12092
[15] "Emerging Properties in Self-Supervised Vision Transformers” https://arxiv.org/abs/2104.14294, "DINOv2: Learning Robust Visual Features without Supervision” https://arxiv.org/abs/2304.07193
[16] “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” https://arxiv.org/abs/1810.04805
[17] Waymo E2E Open dataset https://waymo.com/open/data/e2e#camera-data

Data scale is NOT all you need