In this research, we applied pre-processing techniques to extract keywords using the chunking technique, which accepts the textual data in its real form.
3.3. Text Preprocessing
Text processing is one of the most common tasks in natural language processing (NLP) applications that help transform the raw/unstructured text into some format that a machine can understand.
3.3.1. Removing Numbers, Punctuation
Numbers and punctuation usually do not convey value in text analysis tasks. The numbers or digits in reviews can be used as item price, the number of food items ordered, etc. For example, some numbers and punctuation marks are used in these reviews.
Review Examples:
“Poor service, terrible food, overpriced. Why bother?”
“best pizza in the city! great staff! yay strada!”
“One of the best restaurants in the city; great cocktail list, also: Order the Max Valiquette!”
“Lobster Tuesday for 25 $, good linguini and lobster.”
“Service was very good, friendly. 3–5 over priced ($15) for rigatonni.”
In the above examples, there are numbers and punctuation marks used, like (?!-().). The machine learning model is affected by such irrelevant data items, and leads to bad results. Therefore, in sentiment analysis like work, we first remove all the numbers or digits and punctuation from the reviews so that model may train on actual real data values.
Contractions are the short form of words (or a group of words) written in certain letters and pronounced differently than the complete word(s). In most contractions, an apostrophe (’) represents the missing letters. Most of the time, reviewers use contractions in their reviews. It also represents a particular person’s writing style as the review itself is not something that must be very clear In the sense of following grammatical rules and word representation. The most common contractions contain verbs, auxiliaries, or models attached to other words. Some examples of reviews containing contractions are as follows:
Most bland veal sandwich I’ve ever had.
Shouldn’t have been surprised but the food quality wasn’t great and the service took forever.
Would’ve expected more for $19 but the quality is there.
Won’t be going here again.
Fantastic service. I can’t say a negative thing.
In the above review examples, some contractions are used, they are: I’ve, Shouldn’t, Would’ve, Won’t, can’t.
3.3.2. Text Tokenization, Normalization, and Stop-Words
Text tokenization is the very first step in text preprocessing before applying any operation at the word level. It is an algorithm that breaks down the text strings into individual words and other punctuation symbols by white-spaced characters. We had to conduct this process first before feeding the data into the part-of-speech (POS) tagger. Afterward, we normalized the raw text into canonical form. A single English word like “connect” may have multiple forms “connects”, “connected”, “connection”, “connecting”, “connectivity”, etc. Text normalization converts all such forms into their original word or root word. We do this to allow NLP to recognize words with similar meanings. Text is normalized using two approaches, stemming or Lemmatization.
3.3.3. Stemming and Lemmatization
Stemming and Lemmatization are text normalization techniques in NLP (natural language processing). Usually, a word has multiple meanings based on its context in the text; in the same way, different words convey different meanings. We use different word forms in our sentences based on the grammar rules to convey the complete and correct message. However, in the field of NLP, ML models do not work on such different forms of words; the models treat such multiple forms of words as different individual entities that cause more extensive storage and high computation with no use. The ML model is also affected by such variations of words, so we converted all different forms of words into their root word in data preprocessing activities.
In the above short reviews, users use ‘liked’, ‘working’, ‘respectful’, ‘disconnected’. These are the extended forms of words. These can also be liking, likeness, worked, respected, respectfulness, respected, disconnection, etc. Therefore, in NLP, we convert such words into their root forms: ‘like’, ‘work’, ‘ respect’, ‘disconnect’, using both stemming and lemmatization techniques by SnowballStemmer and WordNetLemmatizer, respectively.
Although both techniques are used to obtain root words, the difference is that stemming works by cutting the end or beginning of the word and extracting the common word form among all its variants as the final root word. Most of the time, it works successfully, but not always. For example, the words ‘study’, ‘studying’, ‘studies’, ‘studied’ stemming would extract the word ‘stud’ as the root word, because changes are after the letter ‘d’, but the word ‘stud’ is not the correct word and is wrong in this case. In contrast, lemmatization extracts the root word based on its dictionary meaning. For the sample example above, lemmatization will extract the word ‘study’ as the root word. However, the drawback of lemmatization is that it is significantly slower than stemming as it has to look up the word in the dictionary for the correct root.
The existing techniques used for recommendation are collaborative filtering and content-based filtering. However, these approaches have data sparsity, cold start problems, and scale-ability issues. Our work revolves around graph-based service and hybrid service recommendations. We get the hybrid service recommendation by combining these two approaches using a knowledge graph.
To better understand the process of this recommendation using knowledge graph framework, a flow diagram is given in
Figure 4, which demonstrates the end-to-end flow of the recommendation framework.
3.4. DUSKG Framework
The DUSKG framework consists of three entities with five relations among them. The three entities are user, service, and value features (VFs). These value features are basically the keywords extracted from the reviews. The user is an entity/object that reviews services and for whom the recommendations would be generated. A review is an entity based on which we extracted the value preferences to get the interest and taste of users towards services. It tells us what recommendations are made and to whom. To extract VFs from reviews, the Rake tool is used for keyword extraction, but VFs extracted from it are huge, and some of them are not desired keywords. Some rules are made according to the need that is further applied to the extracted keywords to filter them as required. For this purpose, a new algorithm is developed in which new objectives are achieved in terms of POS tagging of more than just nouns and verbs. Value preferences to VFs of the service are calculated using the sentiment analysis technique Textblob, which gives the polarity of the keyword. Service is an entity on which the reviews are written, and that is what is to be recommended to the target user.
The five relations among entities were identified. The relations were FOCUSON, BELONGTO, USIMILAR, FSIMILAR, SSIMILAR. FOCUSON relations exist between the user and VF entities; it tells us the aspects based on which the user has concern toward any service in the review. For example, if a user writes a review “Food is tasty but the wait time is long”, it means that the user wants to say something about tasty food and complains about the wait time being long. Thus, the relation between a user entity and two VF entities (food and wait time) is FOCUSON because the user’s focus in his review is on the food and wait time. The BELONGTO relation exists between VF and service entities; it tells us that a particular VF belongs to that service. It means this observation is identified in that specific service. In the example given, the relation of those VF entities with its service for which the review is written is the BELONGTO relation. SIMILAR is a relation between two user entities to check similarity between them. A SIMILAR relation exists between two VFs entities. A tool word2vec is used to check the similarity between two VFs. A SIMILAR relation exists between two service entities. Weight vectors are associated with each relation implying the specific similarity values.
DUSKG can be demonstrated as DUSKG = (E, R, T, T, f, f, A, f, f).
Here, E, R, TE, and TR represent the set of entities, services, entity type, and relation type, respectively, in KG. The terms used in the DUSKG framework are explained below.
E = {User, VF, Service}
R = {FOCUSON, BELONGTO, USIMILAR, FSIMILAR, SSIMILAR}
R⊆{{u, v}|(u∈E)∧(v∈E)} relation can only exist between two entities u and v, and both entities must belong to entity set E
fE⊂E × TE function assigns a specific type to entity from entity set E
fR⊂R × TR function assigns a specific type to relation from relation set R
A represents the attributes of entities or relations with their corresponding values
fEA⊂E × A function assigns attributes with values to the entity
fRA⊂R × A function assigns attributes with values to a relation
The relation among entities in DUSKG are defined according to a rule given below:
f(R) = FOCUSON if ((f(E) = User)
∧(f(E) = VF ))
f(R) = BELONGTO if ((f(E ) = VF)
∧(f(E) = Service))
Rules : f(R) = USIMILSAR if ((f(E) = User)
∧(f (E) = User))
f(R) = FSIMILAR if ((f(E) = VF)
∧(f(E) = VF ))
f(R) = SSIMILAR if ((f(E) = Service)
∧(f(E) = Service))
The above rule represents E as a first entity and E as a second entity, the relation FOCUSON can only exist between the user entity and VF entity. The relation BELONGTO can only exist between the VF entity and the service entity. The relation USIMILAR, FSIMILAR, and SSIMILAR can only exist between two user, VF, and service entities, respectively.
The Yelp data of the user, reviews, tips, restaurants, and check-ins are used as input in the whole recommendation process. In
Figure 5, yellow-colored boxes represent entities, and blue-colored text represents the relations among them. The green-colored boxes show our contribution to the recommendation process. The first step is to extract restaurant category data from the business dataset, and then based on this data, the data from other files are extracted. In the second step, the required preprocessing is performed on the data. In preprocessing, some tasks occur, such as stop words removal, tokenization, stemming, lemmatization, etc.; for this purpose, we used the Natural Language Toolkit (NLTK) library in python, which is for text processing, tokenization, parsing, classification, stemming, tagging, and semantic reasoning in textual data. A RAKE tool is used in mining features from user reviews, as discussed before, and then sentiment analysis is performed using the Textblob toolkit discussed above. In the third step, relationships among all the entities are confirmed. In other words, triples are created in this step, then to construct the knowledge graph, we used Knowledge Graph, which interprets a relation. After constructing the knowledge graph, a recommendation algorithm is applied in the fourth step, specifically a hybrid algorithm. In the end, the model will give recommendations to the user based on his interest and profile.
We did not perform data preprocessing on the data because the chunking technique works on the grammar rules and the sentence structure, including all the POS and their proper forms. They include preprocessing techniques, including removing stop-words, lemmatization or stemming, etc., no need to be performed on the data.
There can be eight POS in a sentence or text: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. All POS words play a specific role in the sentence structure. The most important POSs are noun, adjective, and interjection in sentiment analysis-like tasks. We applied chunking with the following rules on the text.
NP: {<NN.*><VB.*><JJ>*<RB>*<UH>*}
{<VB.*><NN><JJ>*<RB>*<UH>*}
{<NN.*><JJ.*>}
{<JJ.*><NN.*>}
{<NN.*><UH>}
{<UH><NN.*>}
{<FAC|ORG|LOC|PRODUCT|EVENT><JJ.*>}
{<JJ.*><FAC|ORG|LOC|PRODUCT|EVENT>}
We observed from the previous studies that people use noun phrases with adjectives in chunking rules to find the polarities from reviews or text. The other parts of speech, like pronouns, adverbs, and prepositions, do not play an important role in getting polarities or sentiment analyses. The noun (NN) words are used to represent a person, place, thing, or idea in a sentence or review. A noun can be a subject or object, about which, the user writes something useful, interesting, or informative in their review. This is what the user provides the data for. For example, nouns can be a restaurant name, food item, location name, etc. The adjective (JJ) words describe a noun in a sentence. They represent the state of a noun, for example, ‘beautiful’, ‘fantastic’, ‘good’, ‘better’, ‘best’, ‘worst’ etc. The verb (VB) expresses the action of anything being completed in the sentence. The verbs in reviews can be ‘eat’, ‘drink’, ‘clean’, etc. The interjection words express emotions, for example, ‘Oh!’, ‘Wow!’, ‘Oops!’, etc. The adverb (RB) modifies or describes a verb, an adjective, or another adverb. For example, the adverbs can be ‘gently’, ‘extremely’, ‘carefully’, etc. Finally, the star (*) is used to acquire different forms of the POS words, e.g., nouns can be singular or plural, so using an asterisk, both types of nouns are considered. Moreover, sometimes there may be a verb or adverb between a noun and adjective, so we also used an asterisk to extract both cases.
These POS words are essential in the sentence to get polarity, semantics, or user opinion about that particular restaurant. The verbs are also helpful in the context of keywords similarity. For example, if one user shares their experience at restaurants and writes in the review, “I enjoyed my stay at the restaurant and especially the swimming in the pool”. Furthermore, another user writes a review like “We eat healthy food and do swimming with our friends, great time, making fun”. Therefore, the verb “swimming” is being used in both reviews posted by different users. The verb swimming represents this restaurant, so these two reviews can be similarly based on the word swimming for that particular restaurant. We can recommend such restaurants to those users who like swimming or who wrote about swimming in their reviews.
In the above rules, it is observed that extracted noun phrase chunks from the text, in which nouns mainly come with an adjective before or after them. In the second pattern, the verb first comes with nouns and adjectives, and after that, the adverb or interjection is the same as before. In the same way, the other such rules acquire noun first and adjective second, an adjective first and noun second, noun first and interjection second, interjection first and noun second. For now, we used three POSs noun, verb, adjective in patterns considering them as essential to get the user opinion from their reviews. We can also use different POSs or with different sequences in patterns.
We performed experiments to understand the working of our keyword extraction method and evaluate its performance. We randomly selected 1000 reviews to compare their approach and our proposed method on keyword extraction. The results we achieved are the following.
3.7. VF2E vs. Keyword Extraction with Chunking
Furthermore, some other comparisons between VF2E and the proposed approach are in finding polarity of just keywords; we performed chunking in the reviews based on some grammar rules defined in our method. In this way, we calculated the polarity, but for that particular chunk from the review, either polarity of all the sentences where that chunk appears in a text or only of that chunk. To better clarify this point, we took some examples to get the polarity of the words. If the word ‘very delicious’ was extracted as a keyword, the polarity of this would be positive. However, in reality, this word was used in a sentence like “The food was not very delicious”. Thus, the polarity of the whole sentence would be negative. Moreover, when we passed this sentence to the chunking method, it gave us some essential chunks from the text like ‘food not delicious’.
Both methods aim to find out the most relevant services/restaurants and recommend them to users. Generally, this process does not follow the traditional machine learning techniques like splitting the dataset into train and test parts, and getting the model trained on the train set and then tested on some unseen test instance to evaluate the performance/learning using classifiers/algorithms. The basic concepts of both recommendation approaches are the same—we have not only training datasets like customer reviews. This is precisely why we prepare our data and construct a particular structure in which we can recommend items to users based on their reviews posted on a service. In our recommendation system, we attempt to recommend services to users without knowing what services they have visited before any step. The system is unaware of the services visited by the users; therefore, the proposed method blindly recommends services to customers in terms of their previously visited services. Although we have some models, structures, and environments in which we do some processing on the data provided by the customer and then come to a decision about what services should be recommended to which user. The foremost important step is keyword extraction from the reviews, which then become entities based on which relationships are created in the knowledge graph. When we have to recommend services to a user, the algorithm crawls on a graph to extract the best recommendations. At this step of extracting keywords, we compared the proposed method with the existing one. They use pruning rules based on which they reject the keywords; their main concern on keyword extraction is basically which keywords should be rejected from many keywords. In the extraction approach, RAKE is used, even though, in VF2E [
8] the authors presented their approach as an improved RAKE method with a set of pruning rules named value feature entity extraction (VF2E) even then the number of keywords extracted is considerable comparatively by our proposed method. As they have already mentioned, the keywords extracted by RAKE are large in number. However, after extracting the keywords by RAKE the authors in VF2E apply pruning rules, and the three rules about which we have concern are as follows:
Rule 1: Exclude VFs, which includes one word and is an adjective, adverb, or interjection.
Rule 2: Exclude VFs, which have the uppercase first letters and whose named entity is identified as GPE (geopolitical entity) or PERSON.
Rule 3: Exclude VFs without a noun.
In the critical analysis considering all the rules mentioned above, we found that they accept only those keywords which contain at least one noun. Thus, it is evident that nouns are the most crucial entity in the graph for which user’s post reviews or talk about a restaurant for that specific entity in their reviews. However, in this rule, it is also evident that we have all the noun words as keywords in the results, and the minimum length of the keyword is two letters; so in this case, the simple rule causes a tremendous amount of noun words as keywords. Rule 3 is a general rule, impacting all keywords, that any keyword without a noun would be discarded. Thus, they accept only those keywords that must include at least one noun. Therefore, if any keyword is not a noun itself, it can only be accepted if it is extracted with any other noun word as one keyword. On the other side, Rule 1 is if any keyword is one word and is an adjective or adverb.
Alternatively, interjection would also be discarded. Finally, it comes to the result that a noun with any other word except these three POS words would be accepted. It leads us to irrelevant and unexpected keywords. For example, in a review, the extracted keywords were:
Review: “The food quality is good. The service is great. Clean environment and the ambiance is incredible.”
Keywords extracted by VF2E [‘ambience’, ‘clean environment’, ‘service’, ‘food quality’].
Keywords extracted by the Chunking method [‘quality is good’, ‘service is great’, ‘clean environment’, ‘ambiance is incredible’].
The keywords ‘ambiance’ and ‘service’ represent little. Just because it has been considered single nouns, the VF2E accepts them according to their pruning rules. Moreover, the proposed method did not accept the single noun as keywords. In our rules, nouns must come up with any other word, most probably the adjectives before or after, to convey proper positive or negative information about that noun, which may affect our recommendations process. The word ‘clean environment’ is acceptable because the noun ‘environment’ in this keyword comes up with ‘clean’. This is the correct keyword that our chunking method also extracted. The keyword ‘food quality’ also does not portray anything worthy, because we do not know what it is about the food quality—is this bad, worst, good, or best. We get such information from this keyword; so, in our method, the extracted keywords properly represent the state of the noun, e.g., ‘service is great’, ‘clean ambiance’, ‘ambiance is incredible’, and the ‘quality is good’. These keywords are much more important and relevant and have a great impact on our recommendations process. Although there may be issues in the keyword, like, ‘quality is good’—the quality referenced is unclear, food quality, staff quality, etc. Therefore, there may be drawbacks of such hidden information in the process at the bottom level. Nevertheless, the difference between the keywords extracted by VF2E and our chunking method is prominent. Some other points to be focused on in terms of our proposed method are that they are applying the POS tagging on words level in the competitor work. Because some words can be treated in multiple forms in the review, for example, the word ‘clean’ is an adjective itself, but if we use this in a sentence like “The environment is spotless”, it is being used as an adjective. However, in the sentence “The environment should have a clean”, the word clean is used as a verb. When we apply any POS tagging technique on a single word, it does not capture the context of that particular word in the sentence. Nevertheless, we applied the POS tagging at the sentence level to obtain the contextual POS tag of the word.