We are a fast-growth, VC-backed, Series A, Machine Learning start-up born from Cambridge, UK PhD research teams in AI and Speech. Our first Deep Learning engine is an ASR (Automatic Speech Recognition) engine that allows businesses to embed Speech-to-Text magic for 35 global languages in their software solutions. We value inclusion and privacy, and our customers get unparalleled accuracy across the widest range of cohorts, without the requirement to lose ownership of their data. Our customers are some of the biggest software companies on the planet.*About the role:*We’re looking for software or data engineers to help us build the next generation of speech-to-text ML systems by improving the scale, quality and breath of our data. We are aiming to train our models on millions of hours of audio and terabytes of text, which will require an ambitious team to collect and manage our data.This is an opportunity for you to take ownership of our data pipeline, which is a critical component in building state-of-the-art models to cement our position as the world’s leading speech-to-text solution. We’re looking for someone able to find creative ways to source new data at scale, improve the reliability of our systems, and design better abstractions for managing our data and analytics.No experience in working with machine learning data is required, although desirable. What we’re after is strong software and design skill and ethos. Although the goal of this role is to support our machine learning operations; you’ll have to be self-directed and able to autonomously find important problems to address while working closely with our modelling and product teams with a shared goal.***Responsibilities:** Taking inventory, understanding, and organising existing data, including availability, usage, and obtaining additional metadata as needed* Writing web scrapers to collect data in many languages* Understanding where we are deficient in data and working out where that data gap could be closed to widen support for things like accents/dialects/languages* Supporting the data sharing agreements with 3rd parties and the management of data transfers between customers and Speechmatics* Obtaining data for both testing and training different use cases, identifying, coordinating and building out network of 3rd-party vendors to support multiple languages as needed for labelling***Essential experience:** Comfortable with Python, Shell scripting, and databases* Design of robust automated pipelines for data acquisition and processing* Data crawling and scraping from many diverse data sources* Ability and motivation to dig into problems across a stack of unfamiliar code, whether it’s networking, infrastructure, or runtime performance of the code***Desirable qualities:** You obsess over keeping things simple* You are self-directed and like moving fast* You are mission-orientated and value working on something bigger than yourself.* You dislike bureaucracy-* Pension matching* Choice of Mac/Windows machine* Hybrid working (remote)* Free food & drinks* Bonus scheme* Private Medical Insurance* Dental & Optical insurance* Cycle scheme* Life Assurance* Telephone call* Tech Test* Video Call with Manager and Team* OfferPython, Shell, Data ArchitecturePython, Shell, Data Architecture, Machine Learning

Apply For This Job

You can apply for this job externally via the button below.

Apply for this job externally