Auswertung von LEGIDA-Polizeiberichten. Teil II: Worthäufigkeiten

In Teil II meiner Serie über Pressemitteilungen, die von der Polizeidirektion Leipzig anlässlich der Demonstrationen der fremdenfeindlichen LEGIDA-Bewegung veröffentlicht wurden, zeige ich heute, welche Worte in diesen Berichten am häufigsten verwendet werden.

## Warning in readChar(con, 5L, useBytes = TRUE): kann komprimierte Datei 'C:/
## ProgrammeNK/GDrive/Projects/R/Polizeiberichte/Legida.RData' nicht öffnen.
## Grund evtl. 'Datei oder Verzeichnis nicht gefunden'
## Error in readChar(con, 5L, useBytes = TRUE): kann Verbindung nicht öffnen

Die Auszählung von Worthäfigkeiten bezieht sich dabei (fast) nur auf bedeutungstragende Worte. Sogenannte Stopwords wurden von der Analyse ausgeschlossen. Die verwendete Stopwords-Liste findet sich unter dem folgenden Link zum Download.

library(ggplot2)

plt.words <- ggplot(df.words, aes(interval, freq, fill = 500 - freq)) +
  geom_bar(stat="identity", position="dodge", width = 0.75) + 
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.words$freq)+10), breaks = seq(0, max(df.words$freq)+10, by = 20)) +
  scale_x_discrete('') +
  theme(legend.position="none") + 
  coord_flip() +
  ggtitle("Häufigste Wortnennungen") +
  geom_text(aes(label = paste0(percent, '%'), ymax = 0), size = 3, fontface=2, 
            hjust = -0.5, vjust = 0.2)

plt.words

plot of chunk bar

Mit dem wordcloud-Paket lässt sich das Ganze auch als Wordcloud darstellen.

colfunc <- colorRampPalette(c("blue", "red"))

set.seed(4)
par(mar = c(0, 0, 0, 0))
wordcloud::wordcloud(txt.wc, 
                     scale=c(3,.3),
                     min.freq=3,
                     max.words=150,
                     random.order=FALSE,
                     colors = colfunc(200))

plot of chunk wc

Schaut man sich die Auszählung der Worthäufigkeiten an, so ist erkennbar, dass es in den polizeilichen Pressemitteilungen häufig um eine zeitliche und r?umliche Einordnung des Geschehens geht. So werden zum einen sehr oft Uhrzeiten berichtet (vgl.). Zum anderen lässt sich erkennen, dass vor allem der Richard-Wagner-Platz, der Augustusplatz sowie der Leipziger Hauptbahnhof zentrale Örtlichkeiten der LEGIDA-Demonstartionen sind.

LEGIDA rallies: A heat map of day times using ggplot2

LEGIDA is a Leipzig based offshoot of the rigth-wing and xenophobic PEGIDA movement.

Since January 2015, LEGIDA has held at least one rally per month against what they call the “Islamisation of the Western world”.

Usually, the Leipzig Police Department publishes online reports describing what happened at these rallies.

In my following blog posts I will show, what kind of information can be derived from these reports and how these information can be visualized. My blog posts will have a technical rather than a political character.

Today, I'm going to show how to find information about the time of the day the rallays took place and how to visualize these specifications of time using a pie chart.

In the first code chunk, we will simply load the concatenated police report as a character vector named txt. The text of the police reports is rather unstructured. However, specifications of time are constantly made in the format: two digits followed by one colon followed by two digits, e.g. '19:00' for 7 p.m. Thus, all specifications of time can be extracted with a simple Regular Expression. Since we are not particulary interested in exact to the minute specifications, we save the speifications of hours as a numeric vector.

time.str <- unlist(regmatches(txt, gregexpr("\\d{2}\\:\\d{2}", txt)))
time.str

[1] “19:00” “17:00” “21:00” “15:00” “18:00” “17:00” “19:00” “18:45”
[9] “18:30” “20:20” “17:44” “21:45” “18:00” “17:30” “19:00” “20:20”
[17] “21:30” “19:00” “20:00” “21:15” “21:30” “22:30” “19:15” “19:45”
[25] “21:00” “18:15” “16:15” “19:30” “20:15” “21:30” “22:00” “16:45”
[33] “19:00” “17:20” “18:00” “19:00” “20:15” “19:00” “21:15” “22:45”
[41] “19:10” “19:40” “21:00” “22:00” “19:30” “21:30” “23:00” “21:45”
[49] “17:30” “19:00” “21:00” “19:00” “20:00” “21:15” “21:45” “17:15”
[57] “17:00” “18:00” “19:10” “21:00” “17:00” “19:15” “20:40” “20:15”
[65] “21:45” “18:00” “22:00” “18:30” “18:00” “18:30” “18:00” “18:45”
[73] “19:50” “20:00” “20:20” “21:15” “21:00” “18:45” “19:00” “20:40”
[81] “20:00” “20:45” “21:15” “19:50” “20:40” “21:00” “20:00” “20:45”
[89] “21:00” “19:00” “20:00” “20:50” “21:45” “21:00” “21:30” “19:00”
[97] “20:00” “20:45” “21:20” “21:30” “19:00” “19:30” “20:20” “21:00”
[105] “21:35” “18:00” “19:00” “18:00” “20:50” “19:00” “20:00” “20:50”
[113] “20:50” “19:00” “19:50” “20:35” “20:55” “17:35” “18:35” “19:00”
[121] “20:45” “21:00” “18:00” “18:45” “19:00” “19:35” “20:20” “20:30”
[129] “20:40” “22:00” “18:00” “19:20” “02:30” “19:05” “21:10” “17:40”
[137] “18:45” “20:00” “21:20” “18:30” “19:20” “20:00” “21:45” “19:00”
[145] “19:30” “20:45” “21:45” “23:00”

time.str <- as.numeric(stringr::str_sub(time.str, 1, 2))

[1] 19 17 21 15 18 17 19 18 18 20 17 21 18 17 19 20 21 19 20 21 21 22 19
[24] 19 21 18 16 19 20 21 22 16 19 17 18 19 20 19 21 22 19 19 21 22 19 21
[47] 23 21 17 19 21 19 20 21 21 17 17 18 19 21 17 19 20 20 21 18 22 18 18
[70] 18 18 18 19 20 20 21 21 18 19 20 20 20 21 19 20 21 20 20 21 19 20 20
[93] 21 21 21 19 20 20 21 21 19 19 20 21 21 18 19 18 20 19 20 20 20 19 19
[116] 20 20 17 18 19 20 21 18 18 19 19 20 20 20 22 18 19 2 19 21 17 18 20
[139] 21 18 19 20 21 19 19 20 21 23

Since we are only interested in the time span between 12 p.m and 12 a.m., we transform our numeric vector time.str into a vector of class factor containing only day times of the specified span. Afterwards, we save this vector as a table.

time.str <- factor(time.str, levels = c(13:24))
time.tab <- table(time.str)

In the next step, we create a table containing the proportions for each hour and save these specifications into a new vector named time.vecp.

time.tabp <- round(prop.table(table(time.str)), 2)
time.vecp <- as.numeric(as.character((time.tabp)))*100

[1] 0 0 1 1 7 15 24 23 22 4 1 0

Finally, we want to visualize our results. Since the form of a clock can be very good reproduced with a pie chart, we first create a dataframe with twelf segments of the same size (time). To this dataframe, we add two more variables: our proportional time vector (value) and the labels for visualizing the clock (labs).

df <- data.frame(time = rep(1,12),
                 value = time.vecp,
                 labs <- c(1:12))

The pie chart is plotted using ggplot2. The result is kind of a heat map visualizing the day times the LEGIDA rallies usually take place.

library(ggplot2)

  ggplot(df, aes(x = "", y = time, fill = value)) +
    geom_bar(width = 1, stat = "identity", colour = "grey") +
    scale_y_continuous('', limits=c(0, 12), breaks = seq(1,12,1),
                       labels=df$labs) +
    scale_x_discrete('') +
    scale_fill_distiller('Percent', palette = 'Oranges', space = "Lab", direction = 1) +
    coord_polar(theta = "y", start = 0) +
    labs(title = "LEGIDA clock") +
    theme_minimal() +
    theme(axis.text = element_text(size = 18)) 

plot of chunk plot

Obviously, the rallies usually take place between 6 and 9 p.m.


Last update: 2016-08-30, after the 35th^ LEGIDA rally.

Das Kanzlerduell 2013

Diese “Gradient Word Cloud” habe ich mit Hilfe des GNU R “qdap”-Paketes generiert. Die Analyse des “Kanzlerduells” zwischen Angela Merkel und Peer Steinbrück zeigt zum einen, welche bedeutungstragenten Wörter am häufigsten benutzt wurden (markiert durch die Größe des Worts) und zum anderen, welcher der beiden Kanzlerkandidaten das entsprechende Wort häufiger benutzt hat als der andere. Dabei wurden alle blau gefärbten Wörter von Angela Merkel und alle rot eingefärbten Wörter von Peer Steinbrück häufiger genannt.

Image