Hi there

Welcome to my website! I am currently seeking opportunities as a Data Scientist or Business Analyst. If you’d like to learn more about my experience and skills, please click the email button below to request my resume. You can also connect with me on LinkedIn or explore my projects on GitHub. I look forward to connecting with you!

Learning with RAG Project

Interactive Learning Hub with RAG Chatbot This project is designed to be a knowledge retrieval and Q&A system, leveraging Retrieval Augmented Generation (RAG) to provide users with an intuitive way to learn from documents and get answers to their specific questions. The platform allows for the display of educational materials (like structured content or, in the future, PDFs) and features an integrated chatbot (“學習小助手” - Learning Assistant) that users can interact with to clarify doubts or explore topics in more detail. ...

Simple RAG

Taiwan Travel RAG Q&A System This project is a simple Retrieval Augmented Generation (RAG) application designed to answer questions about travel destinations in Taiwan. It leverages a knowledge base of Taiwanese cities and popular sightseeing spots to provide relevant information. Source Code: https://github.com/chw18019/simpleRag.git Features Ask questions in natural language about Taiwanese travel destinations. Get answers based on a curated knowledge base of travel information. Simple and intuitive web interface. How it Works The system follows a Retrieval Augmented Generation (RAG) approach: ...

Speaker Verification Project

Project Overview Objective To compare two audio recordings and determine whether they belong to the same speaker. Use Cases Incoming calls: Compare with past speaker embeddings to reduce verification time. Outbound calls: Ensure correct speaker identity before proceeding with tasks. graph LR A[Incoming Call]-->B[Extract Embeddings] --> C[Compare with Retrieved Embedding]-->D[Results] F[Outbound Call]--> B Approach A model was retrained using publicly available data with a downsampled rate of 8kHz. This includes noise removal, VAD, speaker embedding extraction, and similarity comparison. ...

Speaker Verification Introduction

PCA StormVisualizer

Overview The PCA StormVisualizer is an interactive web application designed for the exploration and analysis of weather storm data for the CT Hartford/Bradley (BDL) region. It allows users to filter storm events based on various meteorological criteria, visualize storm characteristics over time, and delve into storm typology using Principal Component Analysis (PCA) on features extracted via TsFresh. Features Interactive Filtering: Filter by Wind Speed (greater than a specified value in knots). Filter by Temperature (below a minimum or above a maximum °F). Filter by One-hour Precipitation (more than a specified value in inches). Adjust Cluster Count for storm data analysis (likely K-Means clustering). Summary Statistics: Total Hours of Storms meeting criteria. Average Wind Speed. Average Temperature. Average Precipitation. Data Visualization: Total Storms Hours over Years: A bar chart showing the total hours of storms meeting the filter criteria, aggregated (e.g., by Month, with a dropdown to change aggregation). Storm Lasting Hours over Years: A scatter plot illustrating the duration of individual storm events against their start time. Storm Clusters (PCA of TsFresh Features): A 2D scatter plot showing storm events projected onto the first two principal components (PC1 vs. PC2). Points are colored by cluster, revealing different storm types. Cluster centers are also marked. This visualization leverages features extracted from time-series data using the TsFresh library. Feature Loadings on Principal Components: A scatter plot displaying how the original TsFresh-extracted features contribute to the principal components, aiding in the interpretation of what defines each component and, consequently, the clusters. Tabular Data: Storm Start Time and Lasting Hours Table: A paginated table providing a list of individual storm events with their start times and durations. Data Download: Option to download the filtered storm data. Option to download the data from the Storm Start Time and Lasting Hours table. Key Analyses Enabled Trend Analysis: Identify trends in storm frequency and duration over several years. Event Characterization: Understand typical and extreme values for wind speed, temperature, and precipitation during storm events. Storm Typology (Clustering & PCA): Discover distinct types of storm events based on a rich set of time-series features automatically extracted by TsFresh. Understand which meteorological characteristics (e.g., dwpf_minimum, relh_maximum, tmpf_root_mean_square) are most influential in differentiating these storm types by examining feature loadings on principal components. Technologies Based on the interface and functionality: ...

Potoo Solutions Capstone Project

This capstone project is guided by Professor Jennifer Eigo and Potoo CEO, Fred Dimyan. In this post, I’ll provide the partial code to manifest some important techniques with Python for data analysis. For the complete jupyter notebook, you can find it on my GitHub repository . Introduction “The era when warehouses and distribution centers stood as large, staid structures designed to simply meet the demand created by sales from America’s retailers, has evolved into a complex technological infrastructure servicing today’s rapidly expanding Ecommerce space” ― Marketing at Rakuten, 2019 ...

Data Visualization using Python

For the complete jupyter notebook, you can find it on my GitHub repository . Introduction While I was the teaching assistant for Data Science using Python at UConn, I made this materials for students to help them get familiar with Python. I think it can be beneficial to others as well. This post is for general practice of exploratory data analysis(EDA) and data visualization using the classic Boston Housing Dataset. The following shows the packages includes in this file: ...

Online News Popularirty Prediction - Part I

This is a school project that helps to get familiar with algorithms and analytical skills. The team used JMP software (from SAS) to do all the work. This work focus on the methodology. For more details, you can find it in this whitepaper. Data Introduction The dataset is called “Online News Popularity Data Set” and it can be accessed from UCI Machine Learning Repository. Each row represents a news article and was collected from January 7, 2013 to January 7, 2015. There are 39,797 rows and 61 columns. ...

Online News Popularirty Prediction - Part II

Followed by the last post, you can find how the analyses were conducted. In this post, I used Google Colab to perform the same analysis with Python, including data import, data pre-processing, and modeling. For the complete jupyter notebook, you can find it on my GitHub repository . Data Import First of all, you need to mount your google drive, which means you authorize google colab to use your files in the drive. ...

Web Scraping using Online News Dataset

When I worked on the Online News Popularity Prediction project, I wondered if I could have more different types of data to predict the popularity, such as title and contents. So, I taught myself and found Scrapy, an open-source framework for extracting the data from websites. For the complete jupyter notebook, you can find it on my GitHub repository. Introduction The history of the web scraping data dates back nearly to the time when the Internet was born. It is not a new technique, and the first crawler-based web search engine can be found in 1993. Nowadays, Python is one of the popular languages for web scraping. This article lists popular Python web scraping framworks and libraries. Scrapy is not the only one you can use, but I found it simple and fast to do the job. ...