[Data Analysis Project] Can News Sentiment Predict Stocks? Building a System to Find Out (Part 1: Data & Sentiment Modeling)
- 오리 오리
- Jun 30
- 3 min read

Introduction: The Invisible Hand, and The Invisible Sentiment
While Adam Smith's "Invisible Hand" explains market efficiency, John Maynard Keynes' "Animal Spirits" describes its irrational fluctuations. Modern financial markets are a space where these two forces coexist, and "Market Sentiment"—the collective psychology of investors—is known to be a significant factor in price movements.
This project aims to quantify this unseen "Market Sentiment" from data, using the methodology of data science, and to statistically test its relationship with actual market indicators like the KOSPI index. This first part details the modeling process of translating unstructured text into a quantifiable "Sentiment Index."
Methodology I: Constructing the Time-Series Data - The First Step to Finding Patterns in a Sea of Information
Economic/Mathematical Principle: The foundation of any Time-Series Analysis is data that is recorded consistently over time. Furthermore, to ensure statistical reliability, it is essential to minimize "Sampling Bias" by not relying on a single source of information.
Implementation Process:
Automated Data Pipeline: Based on this principle, I built an automated robot using Google Apps Script. A time-driven trigger activates the script at a set time every day, fetching news headlines from the RSS feeds of multiple major economic news outlets (e.g., Yonhap News, Maeil Business). The script then records a 'timestamp' and 'source' for each piece of data, automatically appending this information to a "NewsRaw" sheet. This created a robust and consistent time-series database, which served as the foundation for all subsequent analysis.
Methodology II: Sentiment Index Modeling - Designing the Rules to Translate Emotion into Numbers
Economic/Mathematical Principle: To convert unstructured text data into an analyzable variable, I developed a quantitative model to create a Proxy Variable for market sentiment, which I named the 'Sentiment Index.' The core of this model is to map text into the real number space by viewing each word as a vector and assigning it a 'direction'(positive/negative) and 'magnitude' (the score).
Implementation Process:
Weighted Lexicon: First, I constructed a domain-specific dictionary in a "Lexicon" sheet. To reflect the real-world economic principle that not all words have the same impact, I assigned differentiated weights to keywords. For example, a standard term like "rise" was scored as +10, while a more impactful term like "surge" received a higher weight of +20.
Contextual Negation Handling: To overcome the limitations of simple keyword matching, I added a rule to handle contexts like "no longer expect a rise." Using Apps Script, the rule checks for the presence of negation words (e.g., "not," "no," "unable") in a sentence. If a negation word is found, the calculated score for that headline is multiplied by -1, reversing its polarity and thus increasing the logical accuracy of the model.
Daily Index Calculation: The individual scores calculated for each news story were then aggregated into a single, representative daily indicator using a Pivot Table in the "DailySent" sheet. By setting 'Date' as the Row and the 'AVERAGE' of 'Final_Score' as the Value, I calculated the Arithmetic Mean for each day. This process consolidates hundreds of individual data points into a single representative value, our core metric: the "Daily Average Sentiment Index."

Conclusion of Part 1: The Birth of a Quantified Sentiment
Through the process so far, we have succeeded in transforming abstract, high-volume 'news' into a concrete and analyzable time-series dataset—the 'Daily Sentiment Index.' This is a meaningful attempt to combine a behavioral perspective from economics with the quantification methodology of data science.


Comments