Find a new opportunity within our portfolio

40
companies
381
Jobs

Software Engineer Intern - Fuzzy Distinct (Data Platform)

Dataiku

Dataiku

Software Engineering
Paris, France
Posted on Thursday, January 4, 2024

Headquartered in New York City, Dataiku was founded in Paris in 2013 and achieved unicorn status in 2019. Now, more than 1,000+ employees work across the globe in our offices and remotely. Backed by a renowned set of investors and partners including CapitalG, Tiger Global, and ICONIQ Growth, we’ve set out to build the future of AI.

Internship goal

Augment Dataiku data preparation with a tool that can automatically merge nearly identical data records

Detailed description

Today, Dataiku boasts a robust data preparation framework that functions admirably to process a vast amount of data, helping users to have clean databases with the right data (and only the right data) inside them. However, we believe that with your help, we can take it a step further!

In a world where databases can be filled by real humans, data is not always clean. Errors can happen, typos can be made, and sometimes, you want to merge two database tables containing the same information, but not quite in the same format. “Dataiku”, “dataiku”, “data\niku” refer to the same company, but will be considered different entries in your database.

The goal of this internship is to improve the capabilities of our “distinct” processor to support fuzzy matching (aka: matching data that looks almost the same). The new processor will help clients clean up their database, using algorithms like Levenstein distance, Jaro–Winkler distance, n-grams, Jaccard similarity, or Metaphone to detect duplicated information and reduce them to a single line.

During this internship, you will:

  • Get familiar with Dataiku and its data preparation recipes as well as database schemas.
  • Design a new component that uses numerous industry-standard algorithms (Levenstein distance, Jaro-Winkler distance, N-grams, Jaccard similarity, or Metaphone) to automatically detect duplicate data
  • Develop the User Interface that helps the user understand the clusters of data, to ensure he is not grouping too much or too little
  • Celebrate and party because our beloved users will then be able to reduce their data overload!

Stack

  • Python or Java for the backend side
  • JavaScript/Angular for the frontend part
About Dataiku:
Dataiku is the platform for Everyday AI, systemizing the use of data for exceptional business results. By making the use of data and AI an everyday behavior, Dataiku unlocks the creativity within individual employees to power collective success at companies of all sizes and across all industries. Don’t get us wrong: we are a tech company building software. Our culture is even pretty geeky! But our driving force is and will always remain people, starting with ours. We consider our employees to be our most precious asset, and we are committed to ensuring that each of them gets the most rewarding, enjoyable, and memorable work experience with us. Fly over to Instagram to learn more about our #dataikulife.
Our practices are rooted in the idea that everyone should be treated with dignity, decency and fairness. Dataiku also believes that a diverse identity is a source of strength and allows us to optimize across the many dimensions that are needed for our success. Therefore, we are proud to be an equal opportunity employer. All employment practices are based on business needs, without regard to race, ethnicity, gender identity or expression, sexual orientation, religion, age, neurodiversity, disability status, citizenship, veteran status or any other aspect which makes an individual unique or protected by laws and regulations in the locations where we operate. This applies to all policies and procedures related to recruitment and hiring, compensation, benefits, performance, promotion and termination and all other conditions and terms of employment. If you need assistance or an accommodation, please contact us at: reasonable-accommodations@dataiku.com