12. Assignment 12: Fake News#

12.1. Quick Facts#

12.3. Assessment#

Eligible skills: (links to checklists)

  • first chance representation 1 and 2

  • first chance workflow 1 and 2

  • compare 1 and 2

  • optimization 1 and 2

  • clustering 1 and 2

  • classification 1 and 2

  • evaluate (must use extra metrics to earn this here) 1 and 2

  • summarize 1 and 2

  • visualize 1 and 2

12.4. Instructions#

Use the dataset in the assignment template repo to answer the following questions. The data includes variables:

  • ‘text’: contents of an article

  • ‘label’: whether it is real or fake news

  • ‘title’: title of the article

  1. Is the text or the title of an article more predictive of whether it is real or fake?

  2. Are titles of real or fake news more similar to one another?

Consider what differences you can have in how you represent the data and how that might impact your model performance. In particular if you have enough information to answer the question. Use summary statistics and visualizations appropriately in order to explain your results.

Hint

The data set contains a large number of articles (takes a long time to train), you can downsample this to something like a 1,000 articles or so in order to speed up training and evaluation (hint: use shuffle).