-
Notifications
You must be signed in to change notification settings - Fork 722
/
Copy pathJobRecommendationSystem.py
635 lines (490 loc) · 22.8 KB
/
JobRecommendationSystem.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
# %% [markdown]
# # Job recommendation system
#
# The code sample contains the following parts:
#
# 1. Data exploration and visualization
# 2. Data cleaning/pre-processing
# 3. Fake job postings identification and removal
# 4. Job recommendation by showing the most similar job postings
#
# The scenario is that someone wants to find the best posting for themselves. They have collected the data, but he is not sure if all the data is real. Therefore, based on a trained model, as in this sample, they identify with a high degree of accuracy which postings are real, and it is among them that they choose the best ad for themselves.
#
# For simplicity, only one dataset will be used within this code, but the process using one dataset is not significantly different from the one described earlier.
#
# %% [markdown]
# ## Data exploration and visualization
#
# For the purpose of this code sample we will use Real or Fake: Fake Job Postings dataset available over HuggingFace API. In this first part we will focus on data exploration and visualization. In standard end-to-end workload it is the first step. Engineer needs to first know the data to be able to work on it and prepare solution that will utilize dataset the best.
#
# Lest start with loading the dataset. We are using datasets library to do that.
# %%
from datasets import load_dataset
dataset = load_dataset("victor/real-or-fake-fake-jobposting-prediction")
dataset = dataset['train']
# %% [markdown]
# To better analyze and understand the data we are transferring it to pandas DataFrame, so we are able to take benefit from all pandas data transformations. Pandas library provides multiple useful functions for data manipulation so it is usual choice at this stage of machine learning or deep learning project.
#
# %%
import pandas as pd
df = dataset.to_pandas()
# %% [markdown]
# Let's see 5 first and 5 last rows in the dataset we are working on.
# %%
df.head()
# %%
df.tail()
# %% [markdown]
# Now, lets print a concise summary of the dataset. This way we will see all the column names, know the number of rows and types in every of the column. It is a great overview on the features of the dataset.
# %%
df.info()
# %% [markdown]
# At this point it is a good idea to make sure our dataset doen't contain any data duplication that could impact the results of our future system. To do that we firs need to remove `job_id` column. It contains unique number for each job posting so even if the rest of the data is the same between 2 postings it makes it different.
# %%
# Drop the 'job_id' column
df = df.drop(columns=['job_id'])
df.head()
# %% [markdown]
# And now, the actual duplicates removal. We first pring the number of duplicates that are in our dataset, than using `drop_duplicated` method we are removing them and after this operation printing the number of the duplicates. If everything works as expected after duplicates removal we should print `0` as current number of duplicates in the dataset.
# %%
# let's make sure that there are no duplicated jobs
print(df.duplicated().sum())
df = df.drop_duplicates()
print(df.duplicated().sum())
# %% [markdown]
# Now we can visualize the data from the dataset. First let's visualize data as it is all real, and later, for the purposes of the fake data detection, we will also visualize it spreading fake and real data.
#
# When working with text data it can be challenging to visualize it. Thankfully, there is a `wordcloud` library that shows common words in the analyzed texts. The bigger word is, more often the word is in the text. Wordclouds allow us to quickly identify the most important topic and themes in a large text dataset and also explore patterns and trends in textural data.
#
# In our example, we will create wordcloud for job titles, to have high-level overview of job postings we are working with.
# %%
from wordcloud import WordCloud # module to print word cloud
from matplotlib import pyplot as plt
import seaborn as sns
# On the basis of Job Titles form word cloud
job_titles_text = ' '.join(df['title'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(job_titles_text)
# Plotting Word Cloud
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.title('Job Titles')
plt.axis('off')
plt.tight_layout()
plt.show()
# %% [markdown]
# Different possibility to get some information from this type of dataset is by showing top-n most common values in given column or distribution of the values int his column.
# Let's show top 10 most common job titles and compare this result with previously showed wordcould.
# %%
# Get Count of job title
job_title_counts = df['title'].value_counts()
# Plotting a bar chart for the top 10 most common job titles
top_job_titles = job_title_counts.head(10)
plt.figure(figsize=(10, 6))
top_job_titles.sort_values().plot(kind='barh')
plt.title('Top 10 Most Common Job Titles')
plt.xlabel('Frequency')
plt.ylabel('Job Titles')
plt.show()
# %% [markdown]
# Now we can do the same for different columns, as `employment_type`, `required_experience`, `telecommuting`, `has_company_logo` and `has_questions`. These should give us reale good overview of different parts of our dataset.
# %%
# Count the occurrences of each work type
work_type_counts = df['employment_type'].value_counts()
# Plotting the distribution of work types
plt.figure(figsize=(8, 6))
work_type_counts.sort_values().plot(kind='barh')
plt.title('Distribution of Work Types Offered by Jobs')
plt.xlabel('Frequency')
plt.ylabel('Work Types')
plt.show()
# %%
# Count the occurrences of required experience types
work_type_counts = df['required_experience'].value_counts()
# Plotting the distribution of work types
plt.figure(figsize=(8, 6))
work_type_counts.sort_values().plot(kind='barh')
plt.title('Distribution of Required Experience by Jobs')
plt.xlabel('Frequency')
plt.ylabel('Required Experience')
plt.show()
# %% [markdown]
# For employment_type and required_experience we also created matrix to see if there is any corelation between those two. To visualize it we created heatmap. If you think that some of the parameters can be related, creating similar heatmap can be a good idea.
# %%
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
plt.subplots(figsize=(8, 8))
df_2dhist = pd.DataFrame({
x_label: grp['required_experience'].value_counts()
for x_label, grp in df.groupby('employment_type')
})
sns.heatmap(df_2dhist, cmap='viridis')
plt.xlabel('employment_type')
_ = plt.ylabel('required_experience')
# %%
# Count the occurrences of unique values in the 'telecommuting' column
telecommuting_counts = df['telecommuting'].value_counts()
plt.figure(figsize=(8, 6))
telecommuting_counts.sort_values().plot(kind='barh')
plt.title('Counts of telecommuting vs Non-telecommuting')
plt.xlabel('count')
plt.ylabel('telecommuting')
plt.show()
# %%
has_company_logo_counts = df['has_company_logo'].value_counts()
plt.figure(figsize=(8, 6))
has_company_logo_counts.sort_values().plot(kind='barh')
plt.ylabel('has_company_logo')
plt.xlabel('Count')
plt.title('Counts of With_Logo vs Without_Logo')
plt.show()
# %%
has_questions_counts = df['has_questions'].value_counts()
# Plot the counts
plt.figure(figsize=(8, 6))
has_questions_counts.sort_values().plot(kind='barh')
plt.ylabel('has_questions')
plt.xlabel('Count')
plt.title('Counts Questions vs NO_Questions')
plt.show()
# %% [markdown]
# From the job recommendations point of view the salary and location can be really important parameters to take into consideration. In given dataset we have salary ranges available so there is no need for additional data processing rather than removal of empty ranges but if the dataset you're working on has specific values, consider organizing it into appropriate ranges and only then displaying the result.
# %%
# Splitting benefits by comma and creating a list of benefits
benefits_list = df['salary_range'].str.split(',').explode()
benefits_list = benefits_list[benefits_list != 'None']
benefits_list = benefits_list[benefits_list != '0-0']
# Counting the occurrences of each skill
benefits_count = benefits_list.str.strip().value_counts()
# Plotting the top 10 most common benefits
top_benefits = benefits_count.head(10)
plt.figure(figsize=(10, 6))
top_benefits.sort_values().plot(kind='barh')
plt.title('Top 10 Salaries Range Offered by Companies')
plt.xlabel('Frequency')
plt.ylabel('Salary Range')
plt.show()
# %% [markdown]
# For the location we have both county, state and city specified, so we need to split it into individual columns, and then show top 10 counties and cities.
# %%
# Split the 'location' column into separate columns for country, state, and city
location_split = df['location'].str.split(', ', expand=True)
df['Country'] = location_split[0]
df['State'] = location_split[1]
df['City'] = location_split[2]
# %%
# Count the occurrences of unique values in the 'Country' column
Country_counts = df['Country'].value_counts()
# Select the top 10 most frequent occurrences
top_10_Country = Country_counts.head(10)
# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels
plt.figure(figsize=(14, 10))
sns.barplot(y=top_10_Country.index, x=top_10_Country.values)
plt.ylabel('Country')
plt.xlabel('Count')
plt.title('Top 10 Most Frequent Country')
plt.show()
# %%
# Count the occurrences of unique values in the 'City' column
City_counts = df['City'].value_counts()
# Select the top 10 most frequent occurrences
top_10_City = City_counts.head(10)
# Plot the top 10 most frequent occurrences as horizontal bar plot with rotated labels
plt.figure(figsize=(14, 10))
sns.barplot(y=top_10_City.index, x=top_10_City.values)
plt.ylabel('City')
plt.xlabel('Count')
plt.title('Top 10 Most Frequent City')
plt.show()
# %% [markdown]
# ### Fake job postings data visualization
#
# What about fraudulent class? Let see how many of the jobs in the dataset are fake. Whether there are equally true and false offers, or whether there is a significant disproportion between the two.
# %%
## fake job visualization
# Count the occurrences of unique values in the 'fraudulent' column
fraudulent_counts = df['fraudulent'].value_counts()
# Plot the counts using a rainbow color palette
plt.figure(figsize=(8, 6))
sns.barplot(x=fraudulent_counts.index, y=fraudulent_counts.values)
plt.xlabel('Fraudulent')
plt.ylabel('Count')
plt.title('Counts of Fraudulent vs Non-Fraudulent')
plt.show()
# %%
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='employment_type', hue='fraudulent')
plt.title('Count of Fraudulent Cases by Employment Type')
plt.xlabel('Employment Type')
plt.ylabel('Count')
plt.legend(title='Fraudulent')
plt.show()
# %%
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='required_experience', hue='fraudulent')
plt.title('Count of Fraudulent Cases by Required Experience')
plt.xlabel('Required Experience')
plt.ylabel('Count')
plt.legend(title='Fraudulent')
plt.show()
# %%
plt.figure(figsize=(30, 18))
sns.countplot(data=df, x='required_education', hue='fraudulent')
plt.title('Count of Fraudulent Cases by Required Education')
plt.xlabel('Required Education')
plt.ylabel('Count')
plt.legend(title='Fraudulent')
plt.show()
# %% [markdown]
# We can see that there is no connection between those parameters and fake job postings. This way in the future processing we can remove them.
# %% [markdown]
# ## Data cleaning/pre-processing
#
# One of the really important step related to any type of data processing is data cleaning. For texts it usually includes removal of stop words, special characters, numbers or any additional noise like hyperlinks.
#
# In our case, to prepare data for Fake Job Postings recognition we will first, combine all relevant columns into single new record and then clean the data to work on it.
# %%
# List of columns to concatenate
columns_to_concat = ['title', 'location', 'department', 'salary_range', 'company_profile',
'description', 'requirements', 'benefits', 'employment_type',
'required_experience', 'required_education', 'industry', 'function']
# Concatenate the values of specified columns into a new column 'job_posting'
df['job_posting'] = df[columns_to_concat].apply(lambda x: ' '.join(x.dropna().astype(str)), axis=1)
# Create a new DataFrame with columns 'job_posting' and 'fraudulent'
new_df = df[['job_posting', 'fraudulent']].copy()
# %%
new_df.head()
# %%
# import spacy
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def preprocess_text(text):
# Remove newlines, carriage returns, and tabs
text = re.sub('\n','', text)
text = re.sub('\r','', text)
text = re.sub('\t','', text)
# Remove URLs
text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
# Remove special characters
text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Remove digits
text = re.sub(r'\d', '', text)
# Convert to lowercase
text = text.lower()
# Remove stop words
stop_words = set(stopwords.words('english'))
words = [word for word in text.split() if word.lower() not in stop_words]
text = ' '.join(words)
return text
# %%
new_df['job_posting'] = new_df['job_posting'].apply(preprocess_text)
new_df.head()
# %% [markdown]
# The next step in the pre-processing is lemmatization. It is a process to reduce a word to its root form, called a lemma. For example the verb 'planning' would be changed to 'plan' world.
# %%
# Lemmatization
import en_core_web_sm
nlp = en_core_web_sm.load()
def lemmatize_text(text):
doc = nlp(text)
return " ".join([token.lemma_ for token in doc])
# %%
new_df['job_posting'] = new_df['job_posting'].apply(lemmatize_text)
new_df.head()
# %% [markdown]
# At this stage we can also visualize the data with wordcloud by having special text column. We can show it for both fake and real dataset.
# %%
from wordcloud import WordCloud
non_fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 0]['job_posting'])
fraudulent_text = ' '.join(text for text in new_df[new_df['fraudulent'] == 1]['job_posting'])
wordcloud_non_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(non_fraudulent_text)
wordcloud_fraudulent = WordCloud(width=800, height=400, background_color='white').generate(fraudulent_text)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
ax1.imshow(wordcloud_non_fraudulent, interpolation='bilinear')
ax1.axis('off')
ax1.set_title('Non-Fraudulent Job Postings')
ax2.imshow(wordcloud_fraudulent, interpolation='bilinear')
ax2.axis('off')
ax2.set_title('Fraudulent Job Postings')
plt.show()
# %% [markdown]
# ## Fake job postings identification and removal
#
# Nowadays, it is unfortunate that not all the job offers that are posted on papular portals are genuine. Some of them are created only to collect personal data. Therefore, just detecting fake job postings can be very essential.
#
# We will create bidirectional LSTM model with one hot encoding. Let's start with all necessary imports.
# %%
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Dropout
# %% [markdown]
# Make sure, you're using Tensorflow version 2.15.0
# %%
import tensorflow as tf
tf.__version__
# %% [markdown]
# Now, let us import Intel Extension for TensorFlow*. We are using Python API `itex.experimental_ops_override()`. It automatically replace some TensorFlow operators by Custom Operators under `itex.ops` namespace, as well as to be compatible with existing trained parameters.
# %%
import numpy as np
import intel_extension_for_tensorflow as itex
itex.experimental_ops_override()
# %% [markdown]
# We need to prepare data for the model we will create. First let's assign job_postings to X and fraudulent values to y (expected value).
# %%
X = new_df['job_posting']
y = new_df['fraudulent']
# %% [markdown]
# One hot encoding is a technique to represent categorical variables as numerical values.
# %%
voc_size = 5000
onehot_repr = [one_hot(words, voc_size) for words in X]
# %%
sent_length = 40
embedded_docs = pad_sequences(onehot_repr, padding='pre', maxlen=sent_length)
print(embedded_docs)
# %% [markdown]
# ### Creating model
#
# We are creating Deep Neural Network using Bidirectional LSTM. The architecture is as followed:
#
# * Embedding layer
# * Bidirectiona LSTM Layer
# * Dropout layer
# * Dense layer with sigmod function
#
# We are using Adam optimizer with binary crossentropy. We are optimism accuracy.
#
# If Intel® Extension for TensorFlow* backend is XPU, `tf.keras.layers.LSTM` will be replaced by `itex.ops.ItexLSTM`.
# %%
embedding_vector_features = 50
model_itex = Sequential()
model_itex.add(Embedding(voc_size, embedding_vector_features, input_length=sent_length))
model_itex.add(Bidirectional(itex.ops.ItexLSTM(100)))
model_itex.add(Dropout(0.3))
model_itex.add(Dense(1, activation='sigmoid'))
model_itex.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_itex.summary())
# %%
X_final = np.array(embedded_docs)
y_final = np.array(y)
# %% [markdown]
#
# %%
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.25, random_state=320)
# %% [markdown]
# Now, let's train the model. We are using standard `model.fit()` method providing training and testing dataset. You can easily modify number of epochs in this training process but keep in mind that the model can become overtrained, so that it will have very good results on training data, but poor results on test data.
# %%
model_itex.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=1, batch_size=64)
# %% [markdown]
# The values returned by the model are in the range [0,1] Need to map them to integer values of 0 or 1.
# %%
y_pred = (model_itex.predict(X_test) > 0.5).astype("int32")
# %% [markdown]
# To demonstrate the effectiveness of our models we presented the confusion matrix and classification report available within the `scikit-learn` library.
# %%
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix, classification_report
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion matrix:")
print(conf_matrix)
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print("Classification report:")
print(class_report)
# %% [markdown]
# ## Job recommendation by showing the most similar ones
# %% [markdown]
# Now, as we are sure that the data we are processing is real, we can get back to the original columns and create our recommendation system.
#
# Also use much more simple solution for recommendations. Even, as before we used Deep Learning to check if posting is fake, we can use classical machine learning algorithms to show similar job postings.
#
# First, let's filter fake job postings.
# %%
real = df[df['fraudulent'] == 0]
real.head()
# %% [markdown]
# After that, we create a common column containing those text parameters that we want to be compared between theses and are relevant to us when making recommendations.
# %%
cols = ['title', 'description', 'requirements', 'required_experience', 'required_education', 'industry']
real = real[cols]
real.head()
# %%
real = real.fillna(value='')
real['text'] = real['description'] + real['requirements'] + real['required_experience'] + real['required_education'] + real['industry']
real.head()
# %% [markdown]
# Let's see the mechanism that we will use to prepare recommendations - we will use sentence similarity based on prepared `text` column in our dataset.
# %%
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# %% [markdown]
# Let's prepare a few example sentences that cover 4 topics. On these sentences it will be easier to show how the similarities between the texts work than on the whole large dataset we have.
# %%
messages = [
# Smartphones
"I like my phone",
"My phone is not good.",
"Your cellphone looks great.",
# Weather
"Will it snow tomorrow?",
"Recently a lot of hurricanes have hit the US",
"Global warming is real",
# Food and health
"An apple a day, keeps the doctors away",
"Eating strawberries is healthy",
"Is paleo better than keto?",
# Asking about age
"How old are you?",
"what is your age?",
]
# %% [markdown]
# Now, we are preparing functions to show similarities between given sentences in the for of heat map.
# %%
import numpy as np
import seaborn as sns
def plot_similarity(labels, features, rotation):
corr = np.inner(features, features)
sns.set(font_scale=1.2)
g = sns.heatmap(
corr,
xticklabels=labels,
yticklabels=labels,
vmin=0,
vmax=1,
cmap="YlOrRd")
g.set_xticklabels(labels, rotation=rotation)
g.set_title("Semantic Textual Similarity")
def run_and_plot(messages_):
message_embeddings_ = model.encode(messages_)
plot_similarity(messages_, message_embeddings_, 90)
# %%
run_and_plot(messages)
# %% [markdown]
# Now, let's move back to our job postings dataset. First, we are using sentence encoding model to be able to calculate similarities.
# %%
encodings = []
for text in real['text']:
encodings.append(model.encode(text))
real['encodings'] = encodings
# %% [markdown]
# Then, we can chose job posting we wan to calculate similarities to. In our case it is first job posting in the dataset, but you can easily change it to any other job posting, by changing value in the `index` variable.
# %%
index = 0
corr = np.inner(encodings[index], encodings)
real['corr_to_first'] = corr
# %% [markdown]
# And based on the calculated similarities, we can show top most similar job postings, by sorting them according to calculated correlation value.
# %%
real.sort_values(by=['corr_to_first'], ascending=False).head()
# %% [markdown]
# In this code sample we created job recommendation system. First, we explored and analyzed the dataset, then we pre-process the data and create fake job postings detection model. At the end we used sentence similarities to show top 5 recommendations - the most similar job descriptions to the chosen one.
# %%
print("[CODE_SAMPLE_COMPLETED_SUCCESSFULLY]")