Hoseok Lee
Machine Learning Engineer
Summary
- Machine Learning engineer with expertise in training, evaluating, and deploying deep learning models
with TensorFlow/Keras/PyTorch
- Proven track record in applying machine learning techniques to research and industrial domains
- Has experience with classical machine learning (XGBoost, random forest) and deep learning
- Proficient in Python/R/SQL, with project experience in FastAPI/Django/HTML/CSS/JavaScript
Skills
Programming Languages
Python | R | SQL | C++ | HTML | CSS | JavaScript | MySQL | Bash
Tools and Frameworks
AWS | GCP | FastAPI | Git | PySpark | pandas | Django | Tableau
Machine Learning
TensorFlow | PyTorch | Keras | LangChain | XGBoost | sklearn | TensorBoard
Experience
Embedded Machine Learning Engineer
SEP-DEC 2023
ARCTURUS NETWORKS INC
- Automated a pill pouch conveyor system via heatseal detection with a YOLOv5 model. Achieved mAP
of over 95% (0.7 IoU threshold). Based on training metrics from TensorBoard/Weights & Biases,
identified feasible image augmentations.
- Wrote a Bash command-line tool to prepare image data for augmentation and YOLOv5 training.
- Ensembled Neural Image Assessment (NIMA) and pose estimation models to predict aesthetic quality
scores (1-5 scaled) for client’s photo archive as part of an auto-rating application.
Junior Data Scientist
JAN-AUG 2023
ALS GOLDSPOT DISCOVERIES LTD
- Retrieved and preprocessed rock core photos from AWS, trained PyTorch U-Net++ to generate
segmentation masks. Applied perspective alignment with OpenCV, with outputs shown in a web
application to aid ALS Goldspot’s geologists in rock core analysis.
- Designed a custom ranking loss to modify an active learning approach, improving accuracy by 4%
from reproduced results when training labels were limited (10-40% of CIFAR10).
- Visualized performance across hyperparameter search in a GUI using Tkinter and Matplotlib.
Computer Vision Software Developer
MAY-AUG 2022
DEEP BREATHE
- Collaborated on the design of threshold-aware accumulative fine-tuning (TAAFT) to evaluate a lung sliding detector on external ultrasound scan data from multiple institutions
- Stratified data by patient using SQL. Implemented the TAAFT data pipeline with TensorFlow/Keras, from generation to training/evaluation.
- Achieved 92% sensitivity, 82% specificity in detecting absent lung sliding
Deep Learning Research Engineer
SEP-DEC 2021
HUAWEI TECHNOLOGIES CO LTD - WATERLOO DATA SECURITY AND PRIVACY PROTECTION LAB
- Devised a statistical method for PyTorch regression model watermark verification
- Focused research on bridging the gap between classification watermarking schemes and regression counterpart
- Validated regression watermarking scheme using Python and Bash scripts
Biological Data Analysis Researcher
JAN-APR 2021
UNIVERSITY OF WATERLOO - DEPARTMENT OF APPLIED MATHEMATICS
- Laid the groundwork for comprehensive and precise eye-gaze analysis for patients diagnosed with ASD
- Employed a fully convolutional network (FCN) architecture for semantic image segmentation
- Implemented network analysis centrality measures to determine viewing patterns of ASD individuals
Publications
Wu, D.; Smith, D.; VanBerlo, B.; Roshankar, A.; Lee, H.; Li, B.; Ali, F.; Rahman, M.; Basmaji, J.;
Tschirhart, J.; et al. Improving the Generalizability and Performance of an Ultrasound Deep Learning Model
Using Limited Multicenter Data for Lung Sliding Artifact Identification. Diagnostics 2024, 14, 1081.
Selected Projects
Chatbot Web Application
In recent months, students have been using ChatGPT to assist with their studies (e.g. summarization,
alternative concept explanations). However, due to GPT's broad approach to text generation, AI
responses have been relatively unhelpful for highly context-specific user queries. Rather than
fine-tuning a GPT model, which can be time-consuming, information retrieval provides a solution.
In particular, vector-based retrieval has demonstrated powerful use cases for context-specific
text generation, especially when semantic meaning must be preserved. I'm currently building a chatbot
web application using LangChain's RAG (retrieval-augmented generation) capabilities, paired
with Chroma's vector database to act as an AI study partner, robust to advanced university
course material.
The application allows users to upload course notes (currently, only PDF files are accepted).
User isolation is enforced via a bucket-per-user model in Google Cloud Storage to
organize user files. Chat persistence is supported thanks to Upstash Redis, so that
users can get a more conversational experience with the app.
My current version provides a user interface through FastAPI (with Jinja2 templating). I
initially had a proof-of-concept using Streamlit, but quickly realized I needed to migrate
to further customize the UI.
Kaggle Classification Competition
For the course STAT 441 - Classification at Waterloo, the final project was a Kaggle
competition. The task was to predict, using statistical methods covered in the course, the religious
beliefs of European individuals on a scale of 1 to 5.
I led a small team (3 members in total) throughout the project. Our team performed exploratory data
analysis using Tableau to uncover feature correlations. Next, we applied feature engineering
using pandas and scikit-learn to a dataframe of 438 anonymous survey responses, which reduced
downstream feature selection time by over 80%.
Features were selected using XGBoost, and several models (e.g. XGBoost, random forest, kNN)
were stacked for prediction. We made several engineering improvements to our pipeline, including
model persistence, CUDA support, and training callbacks with TensorBoard integration.
Overall, we achieved 6th place out of 41 teams on the public test dataset.
Education
UNIVERSITY OF WATERLOO
Bachelor's of Mathematics
Graduated: April 2024