Assignment 8: Fake News - Programming for Data Science (Fall 2025)

Quick Facts¶

Create one notebook for this assignment, fake_or_real
Export as myst markdown (by installing jupytext which should include frontend features )
Upload (or push) to a branch called assignment8
Open a PR

Use the dataset linked above to answer the following questions:

Is the text or the title of an article more predictive of whether it is real or fake?
Are titles of real or fake news more similar to one another?

The data includes three columns:

Include narrative around the code required to answer the following questions and interpret the results to give an actual answer.

Provide context on your answer and consider how strong it is based on what differences you can have in how you represent the data and how that might impact your model performance.
Consider if the analysis you have done is enough evidence answer the question from the analysis you have completed or could something else change the answer.
Use summary statistics and visualizations appropriately in order to explain your results.
To earn compare, you can compare classifiers or representations, but make sure it is an appriate comparison
If you have questions about how to work on any specific achievement post an issue on the repo

Some example questions/additional analyses to earn innovative:

deepen the analysis by breaking down the two questions above to make the more specific and detailed. For example, how much more predictive, how reliably more predictive, how sensitive to train/test split is your answer, is the more predictive model slower? if so what would be the tradeoff point?
stress test your model by generating additional fake news using an LLM and scraping additional, newer, high quality news articles(if possible, some news agencies are mad at LLM training procedures and have locked their content down) or using the ones in the 20 news groups dataset. In otherwords, create a new test set using real articles from trusted venues and prompt an llm to write fake article.
based on your analysis how could you help teach a person to spot fake news? (hint: model inspection)
How could you get an llm to generate news articles that your classifier thinks are real? (hint: model inspection)
how does it work if you use both the text and the title? Could you give people a simple flowchart that guides them to scan the title for things and then the article? How reliable can that be?
apply clustering to the data and interpret what the results would mean? Do the clusters relate to the articles being real or fake or something else?
add quantitative outcomes computed from the text and see if that value can be predicted from the titles (even as simple as article length)
could a better system be made using just the title first, then the text? what are the risks of that?