Skip to main content

Architecting Data: A Programmer's Guide to Synthetic Data

Level:
intermediate
Duration:
180 minutes

Abstract

Finding good datasets or web assets to build data products or websites with, respectively, can be time-consuming. For instance, data professionals might require data from heavily regulated industries like healthcare and finance. In contrast, software developers might want to skip the tedious task of collecting images, text, and videos for a website. Luckily, both scenarios can now benefit from the same solution, Synthetic Data.

Synthetic Data is artificially generated data created with machine learning models, algorithms, and simulations, and this workshop is designed to show you how to enter that synthetic world by teaching you how to create a full-stack tech product with five interrelated projects. These projects include reproducible data pipelines, a dashboard, machine learning models, a web interface, and a documentation site. So, if you want to enhance your data projects or find great assets to build websites with, come and spend 3 fun and knowledge-rich hours in this workshop.

Tutorial~ None of these topics

Description

Audience

This tutorial is targeted at intermediate-level programmers looking to get started using synthetic data in their projects. The session will be particularly useful for data professionals, full-stack web developers, and educators searching for new ways to enhance their workflows and improve their projects.

Prerequisites

  • 1 year of programming experience with Python
  • Being comfortable with loops, functions, lists comprehensions, and if-else statements.
  • At least 5 GB of free space in their computers.

Outline

Total time budgeted - 3 hours

  1. Introduction and Setup (~10 minutes)
    • Environment set up. An optional free-to-use environment will be provided in Binder, GitPod, Google Colab, and GitHub Codespaces
    • Agenda for the session
    • Instructors intro
    • Motivation for the workshop
  2. Section I - Building Blocks (~40 minutes)
    1. Introduction to Synthetic Data
      • What is it and why use it?
      • How to generate synthetic data with plain Python code?
      • Introduction to the different frameworks available
      • Creating a synthetic data generator module
      • Exercise (5 min)
    2. Analytics
      • Analysing and comparing real data vs synthetic data
      • Creating an analytical proof of concept product with synthetic data
      • Exercise (5 min)
  3. 10-minute break
  4. Section II - Engineering (~60 minutes)
    1. Data Engineering
      • Task - Create synthetic datasets and build ETL pipelines for different use cases
      • Synthetic Data Use Case - Generating data with errors to simulate how data professionals receive data in the real world
      • Exercise (5 min)
    2. Software Engineering
      • Task - Develop a simple website using different Python frameworks such as FastAPI and jinja templates
      • Synthetic Data Use Case - Generatewebsite's assets including images, videos, and text
      • Exercise (5-minutes)
  5. 10-minute break
  6. Section III - Machine Learning (~30 minutes)
    • Quick intro to Machine Learning
    • Task - Create and evaluate different models and pipelines
    • Synthetic Data Use Cases
      1. Data Augmentation
      2. Increase in Privacy
      3. Evaluation of Machine Learning Models
    • Exercise (5 min)
  7. Concluding Thoughts

The speaker

Ramon Perez

Ramon Perez

Hello! I'm Ramon, a data scientist, researcher, and educator living in Sydney. I currently work as a freelance data professional and was previously a Senior Product Developer at Decoded, a technology education company based in the UK. While at Decoded, I created custom data science tools, workshops, and training programs for clients in industries ranging from retail to finance. Prior to that, I held roles at the intersection of education, data science, and research in the areas of entrepreneurship and strategy, alongside a few ventures in consumer behavior and development economics research in industry and academia, respectively. On the personal side, I enjoy giving talks and technical workshops and have had the privilege of participating in several conferences such PyCon, SciPy, PyData, and countless meetup events. In my spare time, I spend as much time as possible mountain biking and exploring many of the outdoor wonders Australia has to offer.