3.3.4. Stemmer

The existing Arabic stemmer tools can be categorized into two types, which are rootbased stemmer and light stemmer. The first type extracts the root of the words while the light stemmers strip affixes (prefixes and suffixes). However, root-based stemmers are a heavy stemmer and known to have some weaknesses such as increasing word ambiguity. Thus, in this work, we decided to use a light stemmer. However, most of the existing Arabic light stemmers can lead to removing important parts of the word which results in a few letters with no meaning. Therefore, we used the developed Arabic Light Stemmer, Iktishaf Stemmer [6]. It is designed to strip affix based on the length of the word to reduce the chance of stripping important letters which could lead to change the meaning or losing important words particularly the words that are related to transportation. Algorithm 2 shows the algorithm of the proposed stemmer. It gets the normalized tokens, the lists of prefixes [P], and suffixes [S] as input. For prefix, we divide them into three lists, each list contains specific prefixes that do not usually come together in one word. P1 includes prefixes such as "&", "'", " (", " " whereas P2 contains "", "! " and P3 includes prefixes such as ")", "\*", ")'", ") +", "),", ")-". Both P1 and P2 will be removed only if the token length is greater than 5 to decrease the chance of making mistakes and stripping important letters that are part of the token. For instance, " .! /!" city starts with " " but the length is greater than 5 so it will not be affected by stemmer. On the other side, any prefixes in P3 will be stripped if the length of the word is greater than 4. Moreover, we take into consideration that the word may contain more than one prefix so the first list of prefixes contains the prefixes that come at the beginning of the word. For instance, the word " 0 1! 23" contains two prefixes: ' ', which is in P1 and '', which is in P2 so '' will be stripped after ' '.

Similarly, for suffix, we have three lists. S1 contains suffixes such as, "-", "4", " ", ""-", "- ", " !", " 5", "", " !" while S2 contains "! ", " 6'", " 6", "'". To clarify, we give an example of the verb " (/7! ". It can be ended with any of the following suffixes: 4, 6', - , !, -, 5, "-. It may also contain two suffixes like in ""89:+/7! ", which contains " 6'" and ""-" So, ""-" will be removed first because it is in S1 and then " 6'"will be removed in the next step.

After removing any suffix from the previous lists, we check the last letter in the word to see if it becomes end with ' ' to replace it with ' '. For example, the word ""89 1" (their car) will become " 1" after removing ""-" but the correct spelling is " 1" so we need to replace ' ' with ' '.

Subsequently, the stemmer removes suffix in S3, which are ' ', ' ' only if the length of the word is greater than five and thus, we reduce the chance of stripping them if they are part of the word like the ' ' in the previous example " 1". It will not be removed because the word consists of 5 letters. After that, we check if the new stemmed token ends with ' ' to replace it with ' '. For example, the words " 5 1" (my car) or " ! 1" (his car) will become after stemming " 1", so we need to replace the last letter to make it " 1". The final suffix is ' ' and it will be replaced with ' '. For instance, " 1" (cars)

will become " 1". Finally, we check the length of the word after striping suffixes and prefixes and keep only words that have at least two characters.


### *3.4. Tweets Labeling Component (TLC)*

To generate a training set for the classifiers, we need labeled tweets. Since we have a very large dataset of around 33.5 million tweets, manual labeling will be very expensive and time-consuming process. We manually labeled approximately twenty thousand tweets of the total 33.5 million tweets and then we combined them with automatically labeled tweets using the automatic labeling approach. The manually labeled tweets also help us to generate dictionaries for automatic labeling since it is a lexical based approach. The following subsections explain the proposed approach for labeling tweets about event classifiers and tweets filtering classifier. Even though we detect events after filtering tweets, we will start with labeling events because the output will be used later on to label the tweets into relevant and irrelevant.

### 3.4.1. Automatic Labeling for Events Tweets Creating Dictionaries

For each event type, we automatically generated a dictionary that contains the top frequent terms using the manual labeled tweets. We manually updated each dictionary to include the missing terms related to each event type in addition to add synonyms. Both manually and automatically added terms in the dictionaries are passed to the stemmer since the search for matching terms will be applied to the tweet after pre-processing. We used Iktishaf light stammer (see Section 3.3.4).

The dictionaries contain a group of terms but we cannot use them directly to search for matching tweets because the degree of relevance to the event type is not equal for all the terms. Therefore, for each event type, we created an N number of terms list. Each list is considered as a level. So, we have N number of levels (L1, L2, ..., Ln). Each term T in the event dictionary is assigned to a level based on the degree of the importance of this term to the event. Thus, the terms that are highly related to the event and almost exist in each report about this event are assigned to the first level (L1) while the last level contains terms that are least related to the event.

Furthermore, we gave each list a weight W based on the level it belongs to, which means we have N weights (W1, W2, ... , Wn). Wn is the highest weight so it is assigned to L1 which contains the most important terms. In this work, we used 4 levels of terms. To clarify, for Accident event, Level 1 includes terms such as Accident (;< ) and crash (=.>), Level 2 contains terms such as car ( 1), driver ( ?!1) and, road ( ?! /@), Level 3 include Ambulance ( (71) and death( +') while the last level (Level 4) contains the less important/relevant terms such as cause ( 0 AB ).

Algorithm 3 shows the automatic labeling algorithm. It receives the pre-processed tweets (tweets\_P), term dictionaries (terms\_D), and event types (event\_T) as input and provides the labeled tweets as an output. Apache Spark is used and the pre-processed tweets are stored in Spark Dataframe (tweets\_DF). For each token in the tweet, it searched for the matched term in the terms dictionary. The "Find Matching Terms" section explains the process of searching for the matching terms. After that, weight is calculated for each labeled tweet. The section "Weight Calculation" clarifies the process of weight calculation. The last step is sorting and filtering the labeled tweets based on the calculated weight, see section "Sort and Filter Automatic Labeled Tweets" for further details.

### Find Matching Terms

For each tweet, we applied the pre-processing steps explained in Section 3.3 to remove irrelevant characters, divide the text into tokens, normalize the tokens, remove the stop words, and apply stemmer. The output is N number of clean normalized and stemmed tokens (K1, K2, ... , Kn). Moreover, we iterated over each tweet and for each token K, we searched for the match terms in the term levels (L1, L2, ... , Ln). The output is a list of existing terms in each level as shown below where Tx is the matching term:

$$L\mathbf{x}\_1 = [T\mathbf{x}\_1 \; , \; T\mathbf{x}\_2 \; \; \dots \; \; T\mathbf{x}\_n \; ] \; \land \; \dots \; \mathbf{L}\mathbf{x}\_n = [T\mathbf{x}\_1 \; \; , \; T\mathbf{x}\_2 \; \; \dots \; \; T\mathbf{x}\_n \; \; ] \tag{1}$$

