We have an on-device speech-to-text (STT) model huge project and are currently sourcing large-scale Manglish (MalayEnglish code-switch) speech and text datasets captured in Malaysia .
We are interested in exploring a partnership with your organization for this project. Specifically, we are looking for :
- Dataset size & delivery : ~10,000 total hours of speech data, delivered in 2,000-hour tranches.
- Domains : A variety of contexts such as casual conversations, customer service calls, field recordings, scripted / read speech, and social interactions.
- Annotations :
- Token-level language tagging (Malay vs. English) with a clear tag-set definition.
- Speaker metadata (consistent speaker IDs, diarization for multi-party recordings, speaker counts, gender / age distribution, and balance).
We would appreciate if you could share :
Whether you currently hold any Manglish speech / text data, and the domains covered.Your ability to supply data meeting the above specifications (either existing or via new collection).Indicative pricing per 2,000-hour tranche and your typical lead time for delivery.Any additional information on your annotation standards, QA process, and licensing terms.We would be happy if you share with us the requested details so we can proceed with the project.