요약

AI 기업들의 도그마는 더 많은 데이터가 더 나은 성능으로 이어진다는 것이지만, 실제로 데이터 규모가 전부는 아닙니다. 고품질 데이터는 더 큰 저품질 데이터셋에 비해 더 나은 성능을 제공합니다. 고품질 데이터를 생성하려면 노이즈를 필터링하고, 레이블이 없는 데이터를 이해하며, 무엇을 레이블링할지 파악해야 합니다. 어노테이션 플랫폼을 통한 대규모 데이터 레이블링도 문제가 있는데, 그들의 인센티브가 종종 일치하지 않고 플랫폼이 시간 소모적이고 오류가 발생하기 쉬우며 비용이 많이 드는 병목 현상이기 때문입니다. AI 시스템을 개선하는 가장 좋은 방법은 자기지도 표현 학습, 파운데이션 모델링, 필터링을 사용하여 상호작용 가능한 방식으로 데이터셋을 지능적으로 표현함으로써 모델에 공급되는 데이터를 이해하는 것입니다. 이러한 관행은 AI 시스템의 성능 저하 위험과 유해한 출력 생성 위험을 방지합니다.

적을수록 좋다

데이터 규모가 전부는 아닙니다. 모델을 사전 학습하는 동안 데이터셋의 크기를 맹목적으로 늘리는 것은 AI 우선 기업들을 심각한 오류의 위험에 빠뜨립니다. 알 수 없는 분포를 가진 대규모 데이터셋으로 모델을 학습시키면 예상치 못한 행동이 발생합니다: 로봇공학에서는 잘못되고 위험한 궤적으로 이어질 수 있고, 헬스케어 회사의 경우 부정확한 위험 평가로, LLM의 경우 유해한 발언 생성으로 이어질 수 있습니다 {9}. X에서 Grok이 이러한 실수를 저질렀고, 그림 0a에 표시된 현재 삭제된 게시물에서 유해한 발언을 생성했습니다. xAI CEO조차도 "전체 인터넷에 대해 학습하는 것이 아니라 학습 데이터에 대해 더 선택적"이어야 한다고 인정했습니다. 그렇다면 이러한 모델을 적절하게 학습하고 평가하기 위해 데이터를 어떻게 올바르게 선택해야 할까요? 어떤 도구들이 있을까요?

해결책은 데이터를 상호작용 가능하고 의미론적으로 충분히 다양한 형태로 지능적으로 표현하는 것입니다. 이 접근 방식은 다음을 돕습니다: 1. 사전 학습과 사후 학습 모두를 위한 학습 및 평가 데이터셋 생성, 2. 데이터의 공백 식별, 3. 해당 공백을 채우는 방법(구매 또는 수집)에 대한 권장 사항 제공.

Figure 0a: Examples of an LLM generating harmful speech likely due to existence of similar text in the training data the xAI team used to train Grok.

Figure 0b: Reaction from the xAI CEO after Grok generated harmful speech. The interesting piece is the teams focus on being selective of the training data. Original post from the Grok CEO https://x.com/elonmusk/status/1944132781745090819

데이터 플라이휠 {10} & 어노테이션 회사

업계에서 대부분의 AI 기업 CEO, AI 연구자, 엔지니어들은 데이터 플라이휠에 통합되는 현대 어노테이션 회사들에 만족하지 못하고 있습니다.

AI 기업들의 현재 일반적인 해결책은 사전 학습을 위한 대규모 레이블이 없는 데이터셋을 수집하고(또는 오픈소스 사전 학습 모델 사용), 의도한 작업에 특화된 또 다른 대규모 데이터셋에 레이블을 지정한 다음, 학습 세트와 평가 세트를 수작업으로 큐레이션하는 것입니다. 레이블링은 일반적으로 데이터 엔진에 통합되는 어노테이션 회사(ScaleAI, SuperAnnotate, Labelbox 등)에 아웃소싱됩니다. 그러나 대규모 데이터셋의 모든 것에 레이블을 지정하는 것은 잘 작동하지 않습니다. 왜냐하면 수백만 또는 수십억 개의 예제로 데이터 레이블링을 확장하는 것은 오류가 발생하기 쉽고, 지속 불가능하게 비용이 많이 들며, 시간이 많이 걸려 AI 기업들을 불만족스럽게 만들기 때문입니다. 더 중요한 것은 레이블링 루프가 끝나지 않는 프로세스라는 점입니다. 데이터 플라이휠이 진화하는 모델과 더 많이 수집된 데이터에 지속적으로 적응하여 레이블링 요구 사항이 유동적이고 시간이 지남에 따라 변경되기 때문입니다. 어노테이션 회사는 변경 속도를 따라갈 수 없습니다. 모델 업데이트는 몇 주 안에 일어날 수 있지만 레이블링은 몇 달이 걸릴 수 있기 때문입니다.

데이터 엔진의 현대적인 레이블링 루프는 다음과 같습니다:

일부 데이터를 수집합니다.
일부 레이블링 사양을 설계하거나 업데이트합니다.
데이터와 사양을 일부 레이블링 회사(Scale, SuperAnnotate 등)에 보냅니다. 레이블링 비용을 지불합니다.
레이블링 회사와 반복하고 모델을 학습시킵니다.
결과를 관찰한 다음 2-5단계를 무기한 반복합니다.

예를 들어, 자율주행 회사는 정지 표지판에 레이블을 지정하고 싶어할 수 있지만, 100만 개의 정지 표지판에 레이블을 지정하고 결과를 본 후 정지 표지판의 "가시성"에 레이블을 지정하고 싶다는 것을 깨닫고, 그런 다음 정지 표지판 주변에 있을 수 있는 나무에도 레이블을 지정하여 "가려진" 레이블을 추가하고 싶다는 것을 깨닫습니다. 이제 모든 데이터(데이터 수집이 지속적이므로 그 사이에 증가한 데이터)를 다시 레이블링해야 합니다! 회사가 모델을 개선하는 동안 이 사이클은 절대 끝나지 않을 것입니다!

Meta가 Scale.AI CEO를 고용하기 위해 49% 지분에 143억 달러를 지출한 것 [11]은 레이블링 회사와의 이러한 어려움 때문에 회사가 지금까지 한 가장 위험한 움직임 중 하나일 수 있습니다.

그렇다면 거대한 데이터셋에 대한 맹목적인 학습이 문제가 있고 모든 것에 레이블을 지정하는 것이 어렵다면 우리는 무엇을 해야 할까요? 지난 4년 동안 이 문제를 연구한 결과, 우리가 찾은 최선의 해결책은 데이터를 충분히 잘 표현하여 데이터에 무엇이 있는지, 그 데이터가 모델에 어떤 영향을 미치는지 선택하고 이해하기 쉽게 만드는 것입니다. 우리는 예제를 빠르게 검색하고 모델을 테스트하기 위한 평가 세트를 빠르게 구축할 수 있는 방식으로 데이터와 대화할 수 있어야 합니다.

그것이 바로 우리가 Interpret AI에서 구축하고 있는 것입니다. 우리는 AI 시스템을 구축하는 회사들이 데이터셋과 상호작용하고 이해할 수 있도록 하는 데이터 인트로스펙션 플랫폼, 데이터 큐레이션 플랫폼, 지능형 데이터 마켓플레이스를 구축하고 있습니다. 우리는 자연어, 오디오, 이미지, 비디오를 사용하여 데이터와 대화하여 유사한 인스턴스를 검색할 수 있는 세상을 상상합니다. 그래서 회사들이 모델을 구동하는 데이터(또는 데이터의 공백)를 신뢰하고 알 수 있도록 합니다. (이 중 어떤 것이라도 공감이 되신다면 ily@interpretai.tech로 연락 주시기 바랍니다)

먼저 도움이 될 것을 확장하세요

전통적인 데이터 플라이휠

Figure 1a: The traditional data engine powering AI solutions in companies.

회사는 데이터셋에 지속적으로 데이터를 수집하는 인프라를 가지고 있습니다 (1b). 그런 다음 팀은 레이블이 지정되면 모델을 개선할 수 있기를 바라는 휴리스틱 데이터 하위 집합을 생성합니다 (1a).
데이터는 레이블링(어노테이션) 회사로 전송됩니다. 레이블링 회사는 레이블(어노테이션)을 생성하고, 이는 팀에 의해 검토되며, 수렴하는 데 몇 달의 왕복이 걸릴 수 있습니다.
사전 학습된 AI 모델이 사전 학습됩니다.
사전 학습된 모델은 레이블링 회사의 레이블을 사용하여 미세 조정됩니다.
최종 모델은 회사의 평가 시스템을 사용하여 평가되어 메트릭을 생성합니다.
그런 다음 회사는 이 피드백을 사용하여 다른 데이터 하위 집합을 선택하거나, 레이블링 요구 사항을 업데이트하거나, 모델 변경을 수행할 수 있습니다. 이 시점에서 데이터셋 하위 집합은 이미 오래되고 있습니다.

참고: 메트릭은 불량한 어노테이션으로 인해 왜곡될 수 있으며, 팀의 지속적인 반복이 필요하며 이는 비용이 많이 들고 시간 효율적이지 않습니다 (6).

Figure 1b: A breakdown of the time requirements for different processes in a traditional company’s approach to solutions. Notice that the major bottleneck is getting labels from a labeling company.

그림 1b: 전통적인 회사의 AI 시스템 시간 제약 및 설정과 이러한 각 부분을 독립적으로 반복하기 위한 대략적인 타임라인. 레이블링 회사가 루프에 있으면 AI 모델을 적절하게 개선하는 레이블을 생성하는 데 몇 달의 반복이 걸린다는 점에 유의하세요.이러한 각 부분이 전통적인 회사와 어떻게 상호작용하는지는 그림 1a를 참조하세요.

Interpret AI의 데이터 플라이휠:

심층 데이터 인사이트로 알고 시작하세요

Figure 2a: Interpret’s AI data flywheel & how we provide immediate data insights.

그림 2a: Interpret AI의 데이터 플라이휠.

사전 학습 및 학습을 위한 즉각적인 데이터 하위 집합 권장 사항 및 향상된 데이터 제안 (각각 1a 및 1b).
이제 팀은 레이블링 회사에 보내기 전에 Interpret가 제안한 훨씬 작은 데이터 하위 집합을 검토합니다. 이러한 데이터 하위 집합은 유동적이며 데이터가 변경됨에 따라 지속적으로 업데이트됩니다 (선택적으로, 회사가 기준 모델을 통합하면 Interpret AI는 데이터가 모델 성능에 미치는 영향에 대한 더 많은 인사이트를 제공할 수 있습니다).
레이블링 회사와의 왕복은 몇 달에서 몇 주로 가속화되며 어노테이션 사양과 데이터셋 선택이 명확하므로 훨씬 저렴합니다.

피드백은 모델에 집중됩니다 (6).
마지막으로, Interpret AI는 데이터 공간을 분석하여 모델 개선을 가속화하기 위해 수집하거나 구매할 데이터에 대한 인사이트를 제공합니다.

Figure 2b: A breakdown of the time requirements for different processes in using Interpret’s platform. On the left hand side feedback iteration speed in green is accelerated. Notice there is no more bottleneck.

그림 2b: 이 그림은 Interpret AI가 고객과 직접 통합하여 모델 학습, 데이터 분류 및 이해, 평가를 가속화하는 방법을 보여줍니다. Interpret AI는 다음을 위한 솔루션을 제공합니다.

기존 데이터 분포 이해.
데이터 공백과 상관관계가 있는 모델 공백 식별.
데이터 공백을 채우기 위한 데이터 구매 및 큐레이션.

사용 사례

우리는 로봇공학, 헬스케어, 에이전틱 LLM 산업 전반에 걸쳐 여러 비즈니스와 협력하고 있습니다. 이 중 어떤 것이라도 공감이 되신다면 ily@interpretai.tech로 연락 주시기 바랍니다.

헬스케어

HealthCo는 환자의 심혈관 질환 위험을 예측하려고 합니다.

학습용

Interpret AI는 우리의 interpret 파운데이션 모델을 사용하여 심혈관 데이터를 분석하고, EHR, 이미지, 가능한 경우 ECG 데이터 [12]를 처리합니다.
Interpret AI는 HealthCo에서 이상 징후 또는 "공백"을 발견하고 이러한 사람들의 인구 통계를 설명합니다 (즉, 여성, 중년, 자녀 없음, 역사적으로 트리메타지딘 처방).
감지된 기록은 전문가에 의해 추가로 분석됩니다. 선택된 데이터는 업데이트되거나, 무시되거나, 역사적으로 트리메타지딘을 처방받은 사람들의 더 많은 데이터를 구매하는 데 사용되거나, 이 특정 그룹에 어노테이션을 달기 위해 레이블링 회사로 전송될 수 있습니다.
선택된 데이터는 AI 심혈관 질환 모델을 학습시키는 데 사용됩니다. HealthCo가 심혈관 모델을 Interpret 플랫폼에 통합하면 모델이 실시간으로 성능이 저하되는 위치를 추가로 분석하여 즉각적인 인트로스펙션을 가능하게 합니다.
이 프로세스는 모델 학습 타임라인을 몇 달에서 몇 주로 단축하여 AI 시스템을 빠르게 개선하고 비용을 절감합니다!

안전을 위해

HealthCo에 심장마비를 겪은 사람들의 예가 있고 이 사람과 유사하며 위험에 처할 수 있는 다른 사람들의 EHR을 분석하고 싶다고 가정해 봅시다.

Interpret AI를 사용하여 HealthCo는 이 사람의 예를 선택하고 관련된 사람들의 풀을 검색하여 신뢰도별로 정렬할 수 있습니다.
이러한 사람들은 위험에 처한 것으로 표시될 수 있으며, 수백만 개의 기록에서 위험에 처한 수백 명의 사람들을 빠르게 식별할 수 있습니다!

로봇공학

DriveCo는 아이들이 야외에서 놀 수 있는 장난감으로 자율 레이스카를 만들고 있습니다.

학습용

Interpret AI는 수집된 레이스카 비디오 데이터의 실행을 분석합니다. Interpret AI는 데이터 보고서를 제공합니다.
Interpret AI는 비디오의 대부분의 재생이 지리적으로 다양하지 않으며 뒷마당 야외에서 레이스카가 주행하는 예가 거의 없다는 것을 발견합니다.
Interpret AI는 DriveCo 팀에 야외 비디오의 더 많은 예를 수집할 것을 권장합니다. 또한 Interpret AI 파운데이션 모델을 사용하여 학습된 방식으로 데이터셋의 균형을 맞춰 이러한 불균형을 완화하려고 합니다.
- Interpret AI가 없었다면 DriveCo는 필요하지 않은 객체를 레이블링하기 위해 1000시간 이상의 레이스카 데이터를 보냈을 수 있습니다! 이제 10시간만 레이블링하면 됩니다!

안전을 위해

이러한 자율 레이스카가 유아 안전에 대한 조사를 받는다고 가정해 봅시다.

DriveCo는 데이터베이스에서 "아기"가 포함된 비디오를 검색하여 이 데이터가 있는지 확인할 수 있습니다.
DriveCo에 데이터가 없으면 팀에 데이터를 수집하도록 알립니다 (아마도 가짜 아기를 사용하기를 바랍니다) 또는 DriveCo가 소비자와 투자자에게 제품이 실제로 아기 주변에서 안전하다는 것을 보여줄 수 있습니다!

여기까지 오게 된 과정

레이블과 사전 학습에 대한 간략한 역사

2015년, Transformer 이전에는 대부분의 모델이 매우 특정한 문제 하위 집합을 해결하도록 학습되었습니다: 분류, 세그멘테이션, 객체 감지(즉, 기초 문제) 등 [1]. 벤치마크는 10k에서 1M 정도의 "다소 큰" 레이블이 지정된 데이터셋이었습니다. {1}

현대적인 사전 학습은 2017년경에 등장하여 게임을 바꿨습니다. 표현 학습에서 차용한 사전 학습은 갑자기 레이블이 없는 데이터셋이 모델 성능에서 엄청난 이득을 가져오는 근본적인 패러다임 전환으로 나타났습니다. 사전 학습에 사용된 레이블이 없는 데이터셋은 레이블이 지정된 형제들에 비해 방대했습니다 [5]. 이것은 다른 기술및 발전 {2}과 결합되어 CLIP [13], DALL-E [14], DINOv2 [15], BERT [16] 등과 같은 현대 파운데이션 모델로 이어졌습니다.

그런 다음 OpenAI는 트랜스포머, 사전 학습, 강화 학습 진전의 기반 위에 구축되어 GPT(생성적 사전 학습 트랜스포머) [6]를 출시하면서 게임을 바꿨습니다. Sora [7], DeepSeek [8], Anthropic [9]은 모두 성능이 뛰어난 모델의 백본으로 대규모 데이터셋에 대한 사전 학습을 사용합니다. 그러나 그 안에는 대부분의 사람들이 이야기하지 않는 예리한 관찰이 숨겨져 있습니다.

사전 학습이 좋은 첫 단계이지만, 이러한 모델의 대부분은 사전 학습된 기반 위에 추가 학습이 필요합니다. 이것이 RL이든 지도 미세 조정이든 가장 성능이 뛰어난 모델은 어떻게든 원래 문제에 정렬 {3}됩니다. 그러나 미세 조정도 어느 정도까지만 확장되므로 사전 학습을 개선하는 것이 향후 모델 성능에 필수적입니다 {4}.

문헌에서 사전 학습을 적절하게 통합하고 데이터 플라이휠을 구축하는 방법의 가장 설득력 있는 예 중 하나는 Meta가 Segment Anything Model (SAM) 및 SAM v2 [10]에서 구축한 레이블이 지정된 데이터 플라이휠입니다. 그러나 이 예에서도 데이터 레이블링은 확장하기가 매우 어렵습니다.

Segment Anything: 혁신과 메시지

요약: SAM이 우리에게 보여주는 것은 품질 보증과 데이터에 무엇이 있는지 이해하는 것이 어렵지만 해결해야 할 중요한 문제라는 것입니다. 더많은 데이터를 추가하는 것이 반드시 답은 아닙니다.

SAM은 학습의 다양한 단계에서 부분적으로 학습된 SAM을 사용하여 인간 레이블 피드백과 함께 대규모 레이블이 지정된 데이터셋을 큐레이션하는 데이터 플라이휠을 구축했습니다. 그들의 접근 방식은 레이블링을 파이프라인에 통합하는 적절한 방법을 보여주지만 올바른 데이터 레이블링 플라이휠조차도 비용이 많이 들고 확장하기 어렵다는 것을 강조합니다. 어느 시점에서 데이터셋이 충분히 커져서 인간이 모든 것에 어노테이션을 달 수 없으므로 다른 인트로스펙션 방법(즉, Interpret가 구축하고 있는 것)이 필요합니다.

대략적으로 SAM의 접근 방식은 다음과 같습니다 [10]

MAE 사전 학습된 계층적 ViT로 시작합니다.
공개적으로 사용 가능한 세그멘테이션 데이터셋에서 SAM을 학습시킵니다.
부분적으로 학습된 SAM을 사용하여 데이터 하위 집합에서 세그멘테이션 마스크를 생성합니다.
인간이 세그멘테이션 예측을 개선하도록 합니다. 그런 다음 마스크를 사용하여 더 많은 객체를 찾기 위해 객체 감지기를 학습시키고 인간이 레이블을 지정하도록 합니다.
데이터셋의 크기를 점진적으로 늘리면서 3-4단계를 반복합니다.
10억 개의 이미지에서 실행하여 SA-1B를 얻습니다. QA 팀을 사용하여 잠재적으로 나쁜 예를 표시합니다. 10억 개의 모든 이미지에 인간 레이블을 제공하는 것은 매우 어렵다는 점에 유의하세요.

비디오 세그멘테이션 모델인 SAM 2의 아이디어도 동일하며, 50.9K 비디오에 걸쳐 35.5M 마스크가 있는 SA-V 데이터셋을 생성했으며, 이는 어떤 비디오 세그멘테이션 데이터셋보다 53배 많은 마스크입니다 [10].

최고의 세그멘테이션 모델은 레이블 피드백이 모두 빠르고 효율적인 데이터 플라이휠에 잘 결합된 작업과 직접 관련된 데이터로 학습되었다는 점에 주목하세요. 사전 학습과 오픈 소스 세그멘테이션 데이터셋 모음으로 학습하는 것은 첫 번째와 두 번째 단계에 불과했습니다.

또한 인간 레이블링이 결국 한계에 도달했다는 점에 주목하세요. 데이터 플라이휠이 10억 개의 이미지에 레이블을 지정하기 시작했을 때 Meta는 여전히 나쁜 예를 표시하기 위해 QA 필터를 실행해야 했습니다. 논문에 따르면 11억 개의 모든 마스크에 어노테이션을 달려면 51,000일의 어노테이션 시간이 걸렸을 것입니다! {5}

우리가 이야기하는 것은 Meta이지만 대부분의 회사에서 그것을 고용하는 것은 터무니없이 비싸고 실행 불가능할 것입니다! {6} 이 규모의 레이블링은 그냥 어렵습니다!

요약을

Annotation companies’ goals are not necessarily aligned with yours…

We have industry experience in MAANG and our team has experience working with annotation companies like Scale, SuperAnnotate, etc. For most labeling (annotation) companies, the business model is:

Let companies generate their own labeling (annotation) spec with perhaps some back & forth depending on the complexity of the labels.
Most annotation companies have different tiers of annotators, the largest pool being non-experts who label everything and the smallest being experts in the field (i.e. Doctors). An annotation company then marshals a pool of human labelers, typically starting with the cheapest ones to do a low quality first pass.
The annotators then label according to the company’s complex annotation spec as best they can, charging per annotation.
Provide feedback and updates to the annotations, possibly updating the annotation spec.

There are four main problems with this process:

annotations are not consistent and are usually not assigned to the right labelers,
the labeling is time-consuming & expensive,
the feedback loop for correcting annotations is erroneous, and
annotation specs change over time as model performance changes.

Addressing 1., labelers are not guaranteed to be suited to their assigned labeling task and often label differently than their peers. For instance, for a healthcare company if the task is “Pick the clinical response that bests diagnoses the patient” these labelers may not even be doctors suited to the task! Additionally, for an autonomous driving company if the task is to “Draw bounding boxes for stop signs” does this include the pole or not? What if it’s the back side of a stop sign? Different annotators will label differently without consulting each other.

Addressing 2., charging per annotation sounds great in theory as the conventional dogma is that more labels help but if and only if the company can afford the cost of a sufficient number of labels to boost model performance; a number that is typically unknown. These annotations will also typically have errors that require AI companies to build internal systems that review the annotations which takes both time (order of months) and more money.

Addressing 3., The feedback loop is not consistent either. Typically the responsibility of annotation verification is pushed to the AI company, which needs to set up their own internal monitoring system (already time-consuming and costly). When an AI company notices an annotation issue, corrections are not guaranteed to be from the same annotator who created the problematic label and sometimes annotation companies will relabel the entire problematic example instead of correcting it which costs more. For instance, for an autonomous driving company might want to label instance masks of traffic lights and people. In this dummy example, the first annotator makes a mistake and forgets to label traffic lights not facing the camera. The AI company flags it and sends it off to be re-reviewed but the way the annotation company fixes this is by sending the image to a new annotator who relabels everything from scratch! The second annotator fixes the original issue but doesn’t label policeman as “people” and now a new issue emerges! See Figure 3a and Figure 3b. This loop has an incredibly low probability of correctly annotating objects correctly ~61% for 50 labels {7}.

Figure 3a: First pass by the first annotator who missed the traffic lights that are not facing the camera. (Image from Waymo Open Dataset [17])

Figure 3b: Second pass from the second annotator who got all the traffic lights but didn’t realize that the “people” class included police officers! (Image from Waymo Open Dataset [17])

Essentially, with this feedback system the labels an annotation company creates are not guaranteed to converge to the right labels!

The incentives of AI companies are not well aligned with those of labeling companies. AI companies want to improve their AI model and their product while annotation companies want to label as much company data as possible so that they can charge for it. You want to make your model performant and so should annotation companies.

Addressing 4., In industry (and research), when trying to solve a problem, there are many possible solutions. Perhaps pretraining on the entire internet will improve your LLM, or perhaps grounding an LLM by training on labeled text-images pairs will help with LLM reasoning, or perhaps adding chain of thought will help. In other words, when designing AI systems we need to try a lot of different things in parallel since sometimes it’s unclear what the best approach will be. Labeling is one solution, which means that as we better understand our problem the label definition is subject to change.

For instance, take labeling stop signs in autonomous driving; suppose that we first label stop signs. We notice that performance improves when we know if a stop sign is partially obstructed, so we update the annotation spec to add a metadata tag called “obstructed” later on when the sign is partially or not visible. We then go back to an annotation company and ask them to relabel all our stop signs with this! This “annotation-platform in the loop” means that every model experiment that updates the labeled dataset is super expensive!

So, one may wonder, why are labeling providers used at all? For two reasons: First, high quality labels on data do help as discussed earlier. In fact, less data with higher quality labels can outperform some of these large pretrained models; SAM being an excellent example. Second, the alternative to not using an annotation company is to create an internal annotation platform which is even more expensive and time consuming, since producing the same volume of labels as the other players can take years!

Conclusion

The optimal data flywheel represents data in a form that’s inherently insightful and interactable: we should be able to detect anomalies and also chat with our data to garner interesting patterns and insights. This flywheel should enhance annotation platforms by focusing on what should be labeled instead of labeling everything {8}. And finally, this data flywheel should align with model performance, tying directly to whatever problem your AI company is solving.

The traditional dogma is that more data “just works” and sometimes deep learning feels like alchemy. Perhaps more data will work for you in the short run but when things “just don’t work” the proper way is to assess failure both in the data & the model and work from there.

Over at Interpret we hope to change the paradigm. If you are interested, reach out to us at ily@interpretai.tech

Footnotes

Back when AlexNet was still a thing circa 2015ish most models for computer vision were trained on a subset of very particular problem types: classification, segmentation, object detection (ie foundation problems) and others like image captioning, scene recognition, pose estimation (see appendix for more details)[1]. Note this was pre “Attention is all you need” when bigrams were a-la-mode. The focus then was model development while benchmarks remained fixed. These benchmarks were “largish” labeled datasets (order of 10k to 1M) that were used to evaluate model performance. Some of the popular CV benchmarks you’re probably familiar with are MNIST, ImageNet, MS COCO, KITTI, Caltech-101 [2]. If you look the largest labeled datasets around this time they were around 1M labels, and that was considered large at the time.
Modern pretraining entered the chat around 2017 and changed the game. Borrowing from representation learning, pretraining came as a fundamental paradigm shift from learning features for only a specific labeled dataset to learning general features on unlabeled data that correlated well with other problems like classification, segmentation, object detection. These datasets compared to their labeled brethern were massive [5]. At the same time, advancements in model training (CUDA optimization which is why NVIDIA hit a 4T market cap), deep learning libraries (tensorflow, pytroch), and new / improved model architectures like Transformers from “Attention Is All You Need” opened up a brand new world. Researchers also noticed that increasing the size of models typically correlated with improved performance on unseen data (from the same data distribution). All of this combined interfaced with modern pretraining algorithms like pretext tasks, contrastive learning, masked label modeling, masked autoencoding (MAE) multimodal modeling [4] unlocking the era of training big models on even massive unlabeled datasets. Ergo, models like CLIP [13], DALL-E [14], DINOv2 [15], BERT [16].
”Alignment” is an overused term I mean alignment in both the “we want our LLM to be helpful not harmful” sense and the “data distribution alignment” sense.
When training / fine-tuning a model, scaling model size correlates with improvement in performance roughly following a power law. In industry, we’re already hitting the peak for model size scaling laws and fine-tuning is giving less and less of an advantage. The next frontier is improving pretraining method to better utilize existing unlabeled datasets.
In the SAM paper, annotations could take 30 seconds (but suppose it took 4 seconds based on the improvements from SAM v2 [10]); reviewing 1.1B masks would’ve required 1,100,000,000 * 4 seconds = ~51,000 days of annotation time!
This is also assuming that the data distribution is stationary (unchanging). If we wanted to increase the labels to a different data distribution (say deep sea diving videos where the semantics & dynamics of objects is different) then finetuning SAM would still require the same data flywheel training process which is also more time and more money.
Suppose that each object has a probability of being mislabeled p=0.01 (ie an annotator labels incorrectly or misses a label once every 100 labels). Assuming 50 objects in a video the probability of succeeding assuming independence is (1 - p)^50 = 61% chance of success! And that’s conservative.
Fundamentally, when AI companies have better clarity on what to label their incentives align with annotation companies.
More and more it is clear very few samples (e.g. thousands) of very high quality data is way better than million of low quality data - this is particularly true in post-traning of LLMs in industry but it is starting to be the focus also of pre-training.
A data flywheel is the loop used to collect data, improve the model, which makes a better product, which then modifies what data to collect and the cycle repeats (for example this image from dataloop.ai https://dataloop.ai/book/the-data-flywheel-effect/). A data engine is the infra for collecting/labeling/evaluating data (for example Scale’s product https://scale.com/data-engine).

Special Thanks

Cameron Tukerman-Lee (also credit for the title)
Gabriele Sorrento
Francesco Pongetti
Lotfi Herzi

Appendix

[1] A more extensive list of popular 2015 foundational problems across different domains so sortof pre multi-modal.
- Computer vision
  - classification
  - segmentation
  - object detection
  - image captioning
  - scene recognition
  - pose estimation
  - Optical Flow Estimation
  - Depth Estimation
  - Face recognition
  - Pose estimation
  - Visual tracking
  - Style transfer
  - Image generation
- Natural Language Processing
  - Machine translation
  - Part of speech tagging
  - Question answering
- Speech Processing
  - Speech recognition
  - Speaker identification
  - Emotion classification
- Time series
- Reinforcement Learning
[2] Popular datasets separated by domain around 2015 Classification: Segmentation: Object Detection: Other Tasks: Depth Estimation: Optical Flow: Pose Estimation: Face Recognition: Video/Action Recognition: Attributes/Multi-label: Reinforcement Learning: Can think of dataset size as number of rollouts.
- ImageNet (ILSVRC 2017) - 1.2M training, 1000 classes - https://www.image-net.org/challenges/LSVRC/2017/index.php
- CIFAR-10/100 - 60K (32x32), 10/100 classes - https://www.cs.toronto.edu/~kriz/cifar.html
- MNIST - 70K handwritten digits - https://www.kaggle.com/datasets/hojjatk/mnist-dataset
- Fashion-MNIST - 70K fashion items - https://github.com/zalandoresearch/fashion-mnist
- SVHN - 600K real world house numbers 10 classes for each digit - http://ufldl.stanford.edu/housenumbers/
- Caltech-101/256 - 9K/30K images 101/256 categories - https://data.caltech.edu/records/mzrjq-6wc02, https://data.caltech.edu/records/nyy15-4j048
- Oxford Flowers 102 - 102 categories - https://www.robots.ox.ac.uk/~vgg/data/flowers/102/
- Oxford-IIIT Pets - 7.4K images, 37 pet breeds - https://www.robots.ox.ac.uk/~vgg/data/pets/
- Stanford Cars - 16K images, 196 car models - https://www.kaggle.com/datasets/eduardo4jesus/stanford-cars-dataset
- FGVC Aircraft - 10.2K images, 100 aircraft variants - https://www.robots.ox.ac.uk/~vgg/data/fgvc-aircraft/
- Food-101 - 101 food categories - https://www.kaggle.com/datasets/dansbecker/food-101
- CUB-200-2011 - 12K bird images, 200 species - https://www.vision.caltech.edu/datasets/cub_200_2011/
- Stanford Dogs - 20K images, 120 dog breeds - http://vision.stanford.edu/aditya86/ImageNetDogs/
- MIT Indoor Scenes - 15K images, 67 indoor categories - http://web.mit.edu/torralba/www/indoor.html
- PASCAL VOC 2012 - 11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
- MS COCO - 328K images, 80 object classes, 91 stuff categories, 5 captions per image, 250k people with keypoints https://cocodataset.org/
- Cityscapes - 5K fine/25K coarse annotations, 8 classes - https://www.cityscapes-dataset.com/, https://www.cityscapes-dataset.com/dataset-overview/#class-definitions
- ADE20K - 25K images, 150 classes - https://groups.csail.mit.edu/vision/datasets/ADE20K/
- PASCAL Context - 10K images, 459 classes - https://cs.stanford.edu/~roozbeh/pascal-context/
- SBD (Semantic Boundaries) - 11K images from PASCAL - https://paperswithcode.com/dataset/sbd
- NYUDv2 - 1.4K RGB-D images - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- SUN RGB-D - 10K RGB-D images - https://rgbd.cs.princeton.edu/
- KITTI Semantic - http://www.cvlibs.net/datasets/kitti/
- PASCAL VOC 2012 - 10K/11K images, 20 classes - http://host.robots.ox.ac.uk/pascal/VOC/
- MS COCO - 328K images, 80 classes, 1.5M instances - https://cocodataset.org/
- KITTI Object - http://www.cvlibs.net/datasets/kitti/
- Open Images (v1 in 2016) - 15.8 images, 6000 classes - https://storage.googleapis.com/openimages/web/index.html
- WIDER Face - 32K images, 393K face annotations - http://shuoyang1213.me/WIDERFACE/
- NYUDv2 - 1.4K RGB-D scenes - https://cs.nyu.edu/~silberman/datasets/nyu_depth_v2.html
- KITTI Depth- http://www.cvlibs.net/datasets/kitti/
- Make3D - 534 images with depths - http://make3d.cs.cornell.edu/data.html
- Sintel - http://sintel.is.tue.mpg.de/
- KITTI Flow - http://www.cvlibs.net/datasets/kitti/
- Flying Chairs - 22K synthetic pairs - https://lmb.informatik.uni-freiburg.de/resources/datasets/FlyingChairs.en.html
- Middlebury - Small but precise benchmark - https://vision.middlebury.edu/flow/
- MPII Human Pose - 25K images, 40K people - http://human-pose.mpi-inf.mpg.de/
- FLIC - 5003 images from movies - https://bensapp.github.io/flic-dataset.html
- Leeds Sports Pose - https://www.kaggle.com/datasets/dkrivosic/leeds-sports-pose-lsp
- LFW (Labeled Faces in the Wild) - 13K images, 5.7K people -https://www.kaggle.com/datasets/jessicali9530/lfw-dataset
- CelebA - 200K images, 10K identities - http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
- MegaFace - 1M images, 690K identities - http://megaface.cs.washington.edu/
- VGGFace - 2.6K people - https://www.robots.ox.ac.uk/~vgg/data/vgg_face/
- UCF-101 - 13,320 videos, 101 actions - https://www.crcv.ucf.edu/data/UCF101.php
- HMDB-51 - 6800 videos, 51 actions - https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/
- Sports-1M - 1M YouTube videos, 487 sports - https://cs.stanford.edu/people/karpathy/deepvideo/
- ActivityNet - 20K videos, 200 classes - http://activity-net.org/
- WIDER Attribute - http://mmlab.ie.cuhk.edu.hk/projects/WIDERAttribute.html
- Berkeley Attributes - https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/poselets/
- Classic control tasks
  - OpenAI Gym (cartpole, mountaincar, acrobat, etc). I remember this before chatgpt lol maybe I’m old
  - MuJoCo (Multi-joint dynamics with contact) like the halfcheetah, hopper, humanoid, etc. This was typically done in a physics simulation and was popular for PPO.
- Board games
  - Go
  - Chess
  - PyGame
- TORCS
- Minecraft
- ViZDoom
- Atari 2600 from DeepMind
[3] Scaling Laws Paper, Larger pretrained models paper
- "Scaling Laws for Neural Language Models" by Jared Kaplan et al. (2020): https://arxiv.org/abs/2001.08361
- "Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level”: https://arxiv.org/abs/2105.06020
[4] Modern pretraining algorithms Pretext Tasks: Contrastive Learning Methods: Masked Modeling: Multimodal Learning:
- Rotation prediction
- Jigsaw puzzles
- Colorization
- Inpainting/Masked patches
- SimCLR (Chen et al., 2020): "A Simple Framework for Contrastive Learning of Visual Representations" [2002.05709] A Simple Framework for Contrastive Learning of Visual Representations
- MoCo v1 & v2 (He et al., 2019/2020): "Momentum Contrast for Unsupervised Visual Representation Learning" [2003.04297] Improved Baselines with Momentum Contrastive Learning
- BYOL (Grill et al., 2020): "Bootstrap Your Own Latent"
- PIRL (Misra & van der Maaten, 2020): "Self-Supervised Learning of Pretext-Invariant Representations" Self-Supervised Learning of Pretext-Invariant Representations
- Masked Language Modeling (MLM): BERT (Devlin et al., 2018)
- Masked Autoencoder (MAE)
- CLIP (Radford et al., 2021): "Learning Transferable Visual Models From Natural Language Supervision" [2103.00020] Learning Transferable Visual Models From Natural Language Supervision
- ALIGN (Jia et al., 2021)
- DALL-E (Ramesh et al., 2021): "Zero-Shot Text-to-Image Generation"
[5] Pretraining datasets
- JFT-300M: google’s internal 300M images psudeo labeled: https://ar5iv.labs.arxiv.org/html/1707.02968 (TO VERIFY)
- LAION-5B: 5.85 billion (image, text) pairs scraped from Common Crawl
- CLIP Training Data: 400M (image, text) pairs https://arxiv.org/abs/2103.00020 (not released)
- Wikipedia: English 20GB
- Kinetics-700: 650k videos (technically has action classes but still used)
[6] Improving Language Understanding by Generative Pre-Training https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[7] Video generation models as world simulators: https://openai.com/index/video-generation-models-as-world-simulators/
[8] DeepSeek LLM: Scaling Open-Source Language Models with Longtermism https://arxiv.org/abs/2401.02954
[9] Constitutional AI: Harmlessness from AI Feedback https://arxiv.org/abs/2212.08073
[10] Segment anything: https://arxiv.org/abs/2304.02643, SAM 2: Segment Anything In Images & Videos https://arxiv.org/pdf/2408.00714. More details below.
[11] https://techcrunch.com/2025/06/13/new-details-emerge-on-metas-14-3b-deal-for-scale/
[12] https://www.nature.com/articles/s41586-025-09227-0
[13] "Learning Transferable Visual Models From Natural Language Supervision” https://arxiv.org/abs/2103.00020
[14] "Zero-Shot Text-to-Image Generation” https://arxiv.org/abs/2102.12092
[15] "Emerging Properties in Self-Supervised Vision Transformers” https://arxiv.org/abs/2104.14294, "DINOv2: Learning Robust Visual Features without Supervision” https://arxiv.org/abs/2304.07193
[16] “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” https://arxiv.org/abs/1810.04805
[17] Waymo E2E Open dataset https://waymo.com/open/data/e2e#camera-data

데이터 규모가 전부는 아니다

요약