Hoseok Lee

Machine Learning Engineer

Summary

Machine Learning engineer with expertise in training, evaluating, and deploying deep learning models with TensorFlow/Keras/PyTorch
Proven track record in applying machine learning techniques to research and industrial domains
Has experience with classical machine learning (XGBoost, random forest) and deep learning
Proficient in Python/R/SQL, with project experience in FastAPI/Django/HTML/CSS/JavaScript

Skills

Programming Languages
Python | R | SQL | C++ | HTML | CSS | JavaScript | MySQL | Bash

Experience

Embedded Machine Learning Engineer SEP-DEC 2023

ARCTURUS NETWORKS INC

Automated a pill pouch conveyor system via heatseal detection with a YOLOv5 model. Achieved mAP of over 95% (0.7 IoU threshold). Based on training metrics from TensorBoard/Weights & Biases, identified feasible image augmentations.
Wrote a Bash command-line tool to prepare image data for augmentation and YOLOv5 training.
Ensembled Neural Image Assessment (NIMA) and pose estimation models to predict aesthetic quality scores (1-5 scaled) for client’s photo archive as part of an auto-rating application.

Junior Data Scientist JAN-AUG 2023

ALS GOLDSPOT DISCOVERIES LTD

Retrieved and preprocessed rock core photos from AWS, trained PyTorch U-Net++ to generate segmentation masks. Applied perspective alignment with OpenCV, with outputs shown in a web application to aid ALS Goldspot’s geologists in rock core analysis.
Designed a custom ranking loss to modify an active learning approach, improving accuracy by 4% from reproduced results when training labels were limited (10-40% of CIFAR10).
Visualized performance across hyperparameter search in a GUI using Tkinter and Matplotlib.

Computer Vision Software Developer MAY-AUG 2022

DEEP BREATHE

Collaborated on the design of threshold-aware accumulative fine-tuning (TAAFT) to evaluate a lung sliding detector on external ultrasound scan data from multiple institutions
Stratified data by patient using SQL. Implemented the TAAFT data pipeline with TensorFlow/Keras, from generation to training/evaluation.
Achieved 92% sensitivity, 82% specificity in detecting absent lung sliding

Deep Learning Research Engineer SEP-DEC 2021

HUAWEI TECHNOLOGIES CO LTD - WATERLOO DATA SECURITY AND PRIVACY PROTECTION LAB

Devised a statistical method for PyTorch regression model watermark verification
Focused research on bridging the gap between classification watermarking schemes and regression counterpart
Validated regression watermarking scheme using Python and Bash scripts

Biological Data Analysis Researcher JAN-APR 2021

UNIVERSITY OF WATERLOO - DEPARTMENT OF APPLIED MATHEMATICS

Laid the groundwork for comprehensive and precise eye-gaze analysis for patients diagnosed with ASD
Employed a fully convolutional network (FCN) architecture for semantic image segmentation
Implemented network analysis centrality measures to determine viewing patterns of ASD individuals

Publications

Wu, D.; Smith, D.; VanBerlo, B.; Roshankar, A.; Lee, H.; Li, B.; Ali, F.; Rahman, M.; Basmaji, J.; Tschirhart, J.; et al. Improving the Generalizability and Performance of an Ultrasound Deep Learning Model Using Limited Multicenter Data for Lung Sliding Artifact Identification. Diagnostics 2024, 14, 1081.

Selected Projects

Chatbot Web Application

In recent months, students have been using ChatGPT to assist with their studies (e.g. summarization, alternative concept explanations). However, due to GPT's broad approach to text generation, AI responses have been relatively unhelpful for highly context-specific user queries. Rather than fine-tuning a GPT model, which can be time-consuming, information retrieval provides a solution.

In particular, vector-based retrieval has demonstrated powerful use cases for context-specific text generation, especially when semantic meaning must be preserved. I'm currently building a chatbot web application using LangChain's RAG (retrieval-augmented generation) capabilities, paired with Chroma's vector database to act as an AI study partner, robust to advanced university course material.

The application allows users to upload course notes (currently, only PDF files are accepted). User isolation is enforced via a bucket-per-user model in Google Cloud Storage to organize user files. Chat persistence is supported thanks to Upstash Redis, so that users can get a more conversational experience with the app.

My current version provides a user interface through FastAPI (with Jinja2 templating). I initially had a proof-of-concept using Streamlit, but quickly realized I needed to migrate to further customize the UI.

Kaggle Classification Competition

For the course STAT 441 - Classification at Waterloo, the final project was a Kaggle competition. The task was to predict, using statistical methods covered in the course, the religious beliefs of European individuals on a scale of 1 to 5.

I led a small team (3 members in total) throughout the project. Our team performed exploratory data analysis using Tableau to uncover feature correlations. Next, we applied feature engineering using pandas and scikit-learn to a dataframe of 438 anonymous survey responses, which reduced downstream feature selection time by over 80%.

Features were selected using XGBoost, and several models (e.g. XGBoost, random forest, kNN) were stacked for prediction. We made several engineering improvements to our pipeline, including model persistence, CUDA support, and training callbacks with TensorBoard integration.

Overall, we achieved 6th place out of 41 teams on the public test dataset.

Education

UNIVERSITY OF WATERLOO
Bachelor's of Mathematics
Graduated: April 2024