Topic Modelling¶

Speeches by German Parliament (Bundestag)¶

Repository: github.com/raphaelw/nlp-bundestag

Project Overview (Blackbox)¶

Image

  • An algorithm works out topics
  • Topics are inspected and manually annotated

Explore Dataset¶

opendiscourse.de: Richter, F.; Koch, P.; Franke, O.; Kraus, J.; Kuruc, F.; Thiem, A.; Högerl, J.; Heine, S.; Schöps, K. (2020). Open Discourse. https://doi.org/10.7910/DVN/FIKIBO. Harvard Dataverse. V3.

Exploration: Text amount by parliamentary fraction¶

Image

"not found" are speeches by members of government (chancellor, ministers)

Exploration: Speech lengths¶

Image

Longest speech by the first minister of finance Fritz Schäffer. December 7, 1956. PDF, Page 2.

  • 160k characters, 21k Words, over 2 1/2 hours reading time (estimated).

Project Overview (Greybox)¶

Image

Preprocessing (spaCy, Gensim)

  • Lemmatization (better → good; walking → walk)
  • Remove stop words (the, of, and, ...)
  • Generate bigrams (word pairs)
  • Vectorization (bag-of-words)
  • Remove rare and common words

Tech Details¶

  • Number of topics specified: 50
  • Computation time:
    • Preprocessing: 10 hours
    • LDA: 7 hours

Tech Used¶

  • Hardware: Intel i7 (3rd Gen), 16 GB RAM
  • Scientific Python
    • JupyterLab, pandas, matplotlib
    • Apache Arrow (load serialized pandas DataFrames)
  • Natural Language Processing
    • spaCy (preprocessing)
    • Gensim (preprocessing and LDA)
  • Visualization
    • WordCloud

Results¶

Visit: github.com/raphaelw/nlp-bundestag/blob/main/topics.md

Topic 8: German reunification (Deutsche Wiedervereinigung)¶

Topic 8

Topic 9: Trade agreements (Handelsabkommen)¶

Topic 9

Topic 16: Family policy (Familienpolitik)¶

Topic 16

Topic 29: Military missions (Militäreinsätze)¶

Topic 29

jawohl - Yes, Sir!

Topic 34: Gender equality (Gleichberechtigung)¶

Topic 34

Final thoughts and room for improvement¶

  • Where are data protection / privacy speeches?
  • LDA Parameters
  • Try different number of topics
  • Improve lemmatization or use stemming
  • Split those long concatenated german words
  • Find stop words specific to this dataset

Thanks for listening!¶