Two University of Waterloo students earned $2.9 million in six months with their data labeling platform and secured $17.7 million (approximately 120 million RMB) in funding just over a year after founding the company.

Serena Ge and her co-founder
This is Datacurve, a young company aiming to challenge Scale AI.
The race for high-quality data has become the fiercest battlefield in AI, spawning companies like Scale AI, Turing, Surge, and Mercor. As the absolute unicorn in this field, Scale AI has a valuation exceeding $20 billion. Turing, which we featured earlier in our “AI Native 100” column, has a valuation of $2.2 billion.
A key differentiator of Datacurve from these data labeling companies — and the reason we’re highlighting it in the column — is its adoption of a “gamified labeling” approach.
It built a platform called Shipd, packaging medium-to-high difficulty programming challenges such as algorithmic problems, debugging tasks, and test cases into “Quests.” Engineers are invited to complete these quests for a clear cash reward, which they receive once their work is approved. The data verified by engineers is ultimately sold to AI companies or model labs for training and fine-tuning large language models (LLMs).
This “bounty hunter” model has helped Datacurve gain popularity. In October 2025, Datacurve announced the completion of a $15 million Series A round, bringing its total funding to $17.7 million. The round was led by Mark Goldberg of Chemistry, with employees from leading AI companies including DeepMind, Anthropic, and OpenAI also appearing on the investor list.
Huxiu spoke with industry investors about the business model of data labeling companies. For these firms, in addition to data quality, organizational management is crucial. The people responsible for labeling work in a “gig” capacity, so effective management and refined operations are essential to engage these gig workers in data labeling.
How to Engage Top Engineers Without Just Money?
Datacurve mentions on its official website that its Shipd platform has now attracted over 14,000 registered engineers to participate in tasks.
This figure raises a key question: Why are so many mid-to-senior level engineers willing to invest time and energy in what seems like data labeling work, when the compensation is far lower than formal development jobs?
In a public interview, CEO Serena Ge provided the answer. She emphasized that money is not the strongest driver; what truly keeps engineers engaged is the sense of challenge, gamification, and participation experience offered by the platform. She defines Shipd as “a consumer product, not a data labeling operation” — a product for users to enjoy and experience, with money as just an added bonus.
To realize this vision, Datacurve has optimized the user experience and enhanced platform appeal in several ways:
- 1. Tasks with sufficient technical challenge: The platform has implemented multi-layer verification mechanisms, including automated testing, peer review, and expert audit, to ensure datasets meet research-grade standards. This design not only improves data quality but also raises the technical threshold for engineers, thereby boosting their motivation to solve the tasks.
- 2. Bounty hunter model and gamified structure: Tasks on Shipd are packaged as “Quests,” covering algorithmic challenges, debugging tasks, UI/UX generation, and more. The platform features leaderboards, streak rewards, and task levels, making participants both problem-solvers and competitors. These mechanisms transform tasks into technical “dungeons” rather than repetitive work orders, while providing engineers with quantifiable prestige accumulation.
- 3. Engineer-centric community culture: Shipd emphasizes an “engineer-first culture,” striving to create an ecosystem of belonging, recognition, and professional identity for high-skill participants — rather than merely a task distribution system.

This “gamification + meritocracy” design sets Shipd apart from traditional platforms. It does not aim to engage everyone, but rather selects those capable of completing specific types of tasks. From the engineers’ perspective, this mechanism is fun, rewarding, and offers real financial benefits. From the platform’s perspective, it establishes a screening mechanism for data quality, forming a unique moat.
Shipd has become a hybrid product that bridges challenges, games, transactions, and knowledge production. Its success relies not on more people, but on stronger talent and higher-quality data.
Product Philosophy and Cold Start Process
Datacurve’s journey began with reverse-engineering demand.
Serena Ge previously interned at Cohere, participating in large language model training projects. Cohere is an AI technology company engaged in the development and commercialization of large language models and AI products, focusing on improving the reasoning and code generation capabilities of internal models. In contrast, Datacurve focuses on external data collection, aiming to build higher-quality, more challenging coding datasets. The inherent differences in their business natures make them natural upstream and downstream partners — a technical collaboration that extended to the capital level, with Cohere later becoming one of Datacurve’s early investors.
Serena’s internship at Cohere made her quickly realize a practical dilemma: while model capabilities are growing stronger, the supply of high-quality coding data remains a bottleneck. Traditional labeling methods cannot meet the complexity and professionalism required by models, and the missing data — like a blank puzzle piece — directly impacts model performance.
What if these missing data points were transformed into challenging problems, and data labeling was reimagined as a gamified platform that incentivizes engineers to contribute data?
Driven by this idea, Serena and Charley Lee tried building a simplified prototype and posting tasks in several technical communities to gather responses. They found that these test tasks quickly attracted a group of engineers interested in alternative programming challenges, and the feedback exceeded their expectations: participants not only completed tasks diligently but also offered improvement suggestions and requested leaderboard points.
This experiment opened the door for Datacurve to join Y Combinator, Silicon Valley’s largest startup accelerator. In the winter of 2024, Datacurve was officially selected into YC’s Winter Program. During this period, it developed the early version of the platform, refined the task review mechanism, and verified its appeal among engineers.

Serena has always maintained that “Shipd is a consumer product for engineers, not a data labeling operation.” The team invested heavily in optimizing the user experience, refining every detail to “make people want to come and stay.” Shortly after launching, the platform had paid out over $1 million in bounties, attracting senior engineers from companies like Amazon and AMD, who provided positive feedback.
After solidifying the two core links of data collection and community operation, Datacurve began advancing its commercialization path. In the early stages, it focused on collaborating with high-end AI labs and tool-based startups, including foundational model labs like OpenAI and Anthropic, as well as intelligent coding tool teams for developers. Leveraging its early investor network and word-of-mouth, Datacurve gradually achieved the connection and sale of high-quality data.
Founded in 2024, Datacurve completed its Seed and Series A rounds in less than a year, raising a total of $17.7 million. It successfully created a closed loop: securing funding, engaging engineers, acquiring high-quality data, partnering with top clients, and then attracting more funding — presenting a clear growth path for a startup.
Who Owns the Code: Copyright Risks and Compliance Mechanisms
On Datacurve’s platform, every piece of code submitted by engineers is ultimately packaged into high-quality datasets and sold to AI companies. This naturally raises questions: Who owns the code? Is this data truly secure? This is not a problem unique to Datacurve, but a common challenge across the entire data labeling industry.
As a representative company in the data labeling field, Surge AI adopts a human-in-the-loop mechanism, where domain experts collaborate with customized models to complete labeling, ensuring outputs are not only accurate but also demonstrate strong contextual understanding. For sensitive or ambiguous tasks, the platform typically arranges multiple rounds of manual review to minimize deviations and misjudgments. On the compliance front, Surge AI provides auditable data processes, allowing clients to track and manage data usage paths. Despite this, Surge AI has experienced internal document leaks, indicating lingering regulatory and security vulnerabilities.
Datacurve’s solution involves multiple layers of protection in its mechanisms:
- 1. Contributor declaration: Every engineer must sign a commitment before submitting code, ensuring the content is original or that they have sufficient authorization to use it.
- 2. Automated plagiarism detection: The platform uses tools to automatically scan code content, identifying duplicates, tampering, or content from sensitive sources to prevent “content scrapers” from participating.
- 3. Consensus review mechanism: Solutions to a task are not only verified by the platform but also rated by multiple engineers. This peer review process not only improves solution quality but also acts as a safeguard against copyright risks with multiple sets of eyes.
In addition, Datacurve controls task sources at the outset by prioritizing questions from controlled repositories, avoiding involvement with proprietary corporate code. Some tasks even require engineers to write code from scratch in a sandbox environment, prohibiting modifications to existing code.
From a legal perspective, Datacurve has also made clear distinctions. The platform uses “data contracts” and “license agreements” to define the scope of data use, ownership, and liability boundaries. Both clients and contributors must understand what they are submitting, purchasing, and agreeing to before collaborating.
Of course, no matter how robust the mechanisms, copyright risks cannot be ignored as data scales and circulates across institutions. Currently, Datacurve’s approach is more solid than that of traditional data platforms, but whether these risk mitigation measures can withstand complex future copyright claims remains to be tested in practice.
The Asian-Founded Data Labeling Field
The data labeling industry is home to a concentration of Asian founders.
Huxiu discussed this with industry investors: data labeling is hard work, and the diligent nature of Asians may make them particularly well-suited to the industry.
From industry-leading unicorn Scale AI, to Mercor and Turing (which transformed expert networks into training factories), to emerging player Datacurve, we see a group of Asian faces from diverse cultural backgrounds. We’ve compiled some representative team members and their product directions below (based on public information).
|
公司 |
成立时间 |
核心团队亚裔成员 |
公司背景(原生数据公司/HR公司转型) |
专注方向(精细高端化/一站式平台) |
核心业务 |
核心竞争力 |
|
Scale AI |
2016 |
Alexandr Wang(CEO)、Lucy Guo(联合创始人) |
原生数据公司 |
一站式平台 |
数据标注、模型评估与平台能力;深耕自动驾驶、生成式AI、国防领域;设有安全对齐实验室 |
深耕自动驾驶、生成式AI、国防等垂直领域;设有安全对齐实验室 |
|
Turing |
2018 |
Jonathan Siddharth (CEO)、Vijay Krishnan (CTO) |
人力资源公司转型 |
精细高端化+人才云一体化 |
人才云服务、AI驱动匹配;训练数据与人才管理一体化;合作客户含OpenAI等 |
AI驱动的人才匹配技术;与OpenAI顶级客户的合作关系 |
|
Mercor |
2023 |
Adarsh Hiremath(CTO)、Surya Midha(联合创始人) |
人力资源公司转型 |
精细高端化 |
用 AI 面试筛选跨领域人才,承接 RLHF、SFT、Eval 任务 |
AI驱动的人才匹配技术;高薪酬吸引顶尖专家; 与OpenAI、Anthropic等签下长约 |
|
Surge AI |
2020 |
Edwin Chen(CEO) |
原生数据公司 |
精细高端化 |
高质量数据标注、RLHF支持、NLP与对抗训练等 |
严格的质控流程、专家级标注团队和现代化API接入能力 |
|
Datacurve |
2024 |
Serena Ge(CEO)、Charley Lee(CTO) |
原生数据公司 |
精细高端化 |
高质量数据标注 |
“赏金猎人”模式吸引熟练的软件工程师、工程师优先理念、严格的质量控制措施 |
When sorting through these data labeling companies, we also found that they generally fall into several categories: those transformed from human resources companies (such as Mercor and Turing) and those evolved from new-type data companies.
Mercor initially started as an AI recruitment firm, matching technical talent through AI interview technology and building a high-quality expert talent pool. As demand for AI data labeling grew, Mercor quickly pivoted to providing data labeling services to AI labs, leveraging its accumulated expert resources in fields like medicine and law. This transformation turned it from a labor supplier for Scale AI into a direct competitor — particularly in RLHF and vertical domain labeling tasks, where Mercor has demonstrated strong competitiveness.
Turing followed a similar transformation path. Initially focused on remote engineer recruitment, Turing built a talent pool through its Talent Cloud model. As market demand evolved, Turing expanded into AI infrastructure services, extending its business from talent matching to code data labeling, model fine-tuning, and enterprise AI transformation consulting. It successfully transformed from a single talent service provider into an integrated platform for training data and talent management.
Datacurve faces significant competitive pressure, most directly from Surge AI — both companies pursue refined, high-quality data. While Datacurve’s bounty model seems innovative, the barrier to replication is not high. The true determinant of the platform’s moat lies in its ability to consistently produce data that enhances model performance, balance high quality with scalability, and maintain long-term engineer participation in the community.
However, Datacurve is not betting its future solely on engineer data. Founder Serena Ge has clearly stated that the platform’s mechanism has cross-industry migration capabilities, with potential future expansion into vertical professional fields such as finance, medicine, and marketing.















No comment content available at the moment