Introduction

I’ve been looking for a Data Analysis job recently and failed to do so. Actually I got several offers for Python Developer position but the work in these companies are quite boring so I left. The fact is Data Science / Data Mining / Machine Learning jobs in the north of Vietnam are quite rare. There are a lot more jobs in this field in Ho Chi Minh city.

With my free time I decided to do some analysis about job market for the last few months in Vietnam. This is written with the main purpose to consolidate my knowledge about exploratory data analysis, so I will not take it very seriously about data source. I wrote some scripts to crawl job data from only one website Vietnamworks.com for the last few months. I tried to get data for the first 5 months of 2016, but it seems Vietnamworks doesn’t show their past data after a specific period. The best I could get is from April. Not much, but this is for fun anyway.

In my last post, I showed you how to crawl job data from Vietnamworks using BeautifulSoup. Scripts used in that post will be reused here, but the scope is a little bit wider (from 26 URLs and several other features).

This post will focus on scaping data for IT jobs, but you can change the base URL to work with any other kind of job you want. Just go to Vietnamworks, use the website’s filter to select the job categories you want to play with, and get the base URL and assign it to the base_url variable in get_all_urls() method.

Note: You can view this blog article directly in IPython notebook format here.

Package requirements:

If you do not have those packages installed on your system, please do so before continue:

  • Python 2.7
  • beautifulsoup4
  • matplotlib
  • seaborn
  • pandas
# Imports & Settings

# Make sure packages in my virtual environment are included
import sys
sys.path.append('/home/hoanvu/anaconda2/envs/ds/lib/python2.7/site-packages/')

import requests
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup
from collections import Counter, defaultdict
from datetime import datetime

%pylab inline
sns.set_context('poster')

# No of URLs to crawl from Vietnamworks
MAX_PAGE = 27
Populating the interactive namespace from numpy and matplotlib


WARNING: pylab import has clobbered these variables: ['datetime']
`%matplotlib` prevents importing * from pylab and numpy

Next, we will write a method to get all URLs for the job categories of your choice.

  • base_url: this is the URL of the first page after you configure your filter on Vietnamworks website and press ‘Search’
  • extended_url_pages: we want to crawl more than just the first page. base_url is the first page, all subsequence pages will have the extension in the URL, with page-2 for the second page, page-3 for the third page and so on

    We we get advantage of this uniform to construct our list of URLs

def get_all_urls():
    """
    Scan through all URLs defined by MAX_PAGE and return the URL
    
    return: list of URLs
    """
    
    base_url = 'http://www.vietnamworks.com/it-hardware-networking-it-software-jobs-i55,35-en'
    extended_url_pages = ['/page-{}'.format(page_number) for page_number in xrange(2, MAX_PAGE)]
    urls = [base_url + extended_page for extended_page in extended_url_pages]
    urls.insert(0, base_url)
    return urls

Data collection

Features that need to be collect:

  • Job title
  • Technical skills
  • Work location
  • Position
  • Company name
  • Job post date

This method get_jobs_in_url() get details for all job in a single URL. You can see my previous post for explanation about the code.

def get_jobs_in_url(url):
    """
    Take a single URL and return a pandas DataFrame contains details about all jobs in that URL.
    
    return: pandas.DataFrame
    """
    req = requests.get(url)
    soup = BeautifulSoup(req.content, 'html.parser')
    
    jobs_table = soup.find('table', {'class': 'link-list'})
    tr_tags = jobs_table.find_all('tr', {'class': 'job-post'})
    
    job_title = []
    job_company = []
    job_location = []
    job_position = []
    job_skills = []
    job_date = []
    
    for job in tr_tags:
        job_title.append(job.find('a', {'class': 'job-title'}).contents[0])
        job_company.append(job.find('span', 'name').contents[0])
        job_location.append(job.find('p', {'class': 'job-info'}).contents[1].find('span').contents[0])
        job_position.append(job.find('p', {'class': 'job-info'}).contents[1].find_all('span')[1].contents[0])
                            
        skill_list_data = job.find('div', {'class': 'skills'})
        if skill_list_data:
            skill_list = skill_list_data.find_all('em', {'class': 'text-clip'})
            
        job_skills.append([skill.contents[0] for skill in skill_list])
        
        # Job post date
        d = job.find('span', {'class': 'views'}).find('span').contents[0].split(' ')[1]
        if d == 'Today':
            d = datetime.datetime.now().strftime("%d/%m/%Y")
        job_date.append(d)
    
        # Job salary
        
    return pd.DataFrame({
            'job_title': job_title,
            'company': job_company,
            'location': job_location,
            'position': job_position,
            'skills': job_skills,
            'post_date': job_date
        })

This method get_all_jobs() just scan through all URLs from get_all_urls() and get details for all jobs listed in each and every page:

def get_all_jobs():
    """
    Scans all URLs and crawls all jobs from all pages
    
    return: pandas.DataFrame contains every possible job that we can find
    """
    urls = get_all_urls()
    data = pd.DataFrame()
    for url in urls:
        data = data.append(get_jobs_in_url(url), ignore_index=True)
        
    return data

jobs = get_all_jobs()
# Convert the job post date to datetime type
jobs['post_date'] = pd.to_datetime(jobs.post_date, format="%d/%m/%Y")

Let’s take a quick look at our data:

jobs.head()
company job_title location position post_date skills
0 Công Ty Cổ Phần Sabre Việt Nam Chuyên Viên IT Ha Noi Experienced (non-manager) 2016-06-02 [Máy Tính, Mạng, Triển Khai Tổng Đài IP]
1 Công Ty TNHH Vận Hành Jmango Việt Nam Java Developer (quantity: 3) Ha Noi Experienced (non-manager) 2016-05-26 [SQL Server-mysql, HTML - CSS - Javascript, Ja...
2 CSC Vietnam SAP Consultants (real Estate, Fi, Co, Ps, Mm, ... Ho Chi Minh Experienced (non-manager) 2016-05-25 [SAP & ERP, SAP Consultant ( FI / CO/ MM ) - R...
3 CSC Vietnam Senior Software Engineer (with Japanese Language) Ho Chi Minh Experienced (non-manager) 2016-05-18 [Interpreter - Translator, Bridge Engineer ( J...
4 CSC Vietnam Java Senior Software Engineer Ho Chi Minh Experienced (non-manager) 2016-05-18 [J2EE Architecture, Java Developer - IT Softwa...


Nice, seems that our data is not very clean, but it says that our code until now is working. Now, let’s jump to the most interesting part and see what our data says?

Exploratory Data Analysis

At first I planned to clean this dataset a little bit more, but then I realize it would be better if I keep it this way and transform each column based on the information I need. If you read this, feel free to take the dataset and tweak it the way you want.

We will try to answer few questions as we go.

First, what is the general status of Vietnam job market?

How many days did we collect the data?

print jobs.post_date.max() - jobs.post_date.min()
37 days 00:00:00

How many jobs were posted during this period?

print "Total jobs: {}".format(len(jobs))
Total jobs: 1279

1277 jobs were posted in only 37 days, I think that this is very interesting number to show that the Vietnam job market is very bustle. If we collect data from other sites, there maybe even much bigger number of jobs posted.

Which city have the most demand for job recruitment?

From the graph below, it’s obvious that Ho Chi Minh and Ha Noi are two cities which have huge demand for human resources in IT, which makes sense! Ho Chi Minh is by far the most demanding city.

Da Nang is also a very hot city for job hunter. There are some jobs that have work location as International, this is very interesting!

all_locations = []
for location in jobs.location:
    if ',' in location:
        all_locations.extend(location.split(', '))
    else:
        all_locations.append(location)

# Remove cities that have less than 5 jobs, just to make grap clearer
most_cities = [key for key, value in Counter(all_locations).iteritems() if value > 5]
new_cities = [city for city in all_locations if city in most_cities]

sns.countplot(y=new_cities)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb35e016ad0>

png

What’s the percentage?

from __future__ import division

location_counter = Counter(all_locations)
total_jobs = sum(location_counter.values())
jobs_in_hcm = location_counter['Ho Chi Minh']
jobs_in_hn = location_counter['Ha Noi']
jobs_in_dn = location_counter['Da Nang']
jobs_in_int = location_counter['International']

print "Total jobs:                     {}".format(total_jobs)
print "Jobs required by Ho Chi Minh:   {:.3f}%".format((jobs_in_hcm / total_jobs) * 100.0 )
print "Jobs required by Ha Noi:        {:.3f}%".format((jobs_in_hn / total_jobs) * 100.0 ) 
print "Jobs required by Da Nang:       {:.3f}%".format((jobs_in_dn / total_jobs) * 100.0 )
print "Jobs required by International: {:.3f}%".format((jobs_in_int / total_jobs) * 100.0) 
Total jobs:                     1477
Jobs required by Ho Chi Minh:   52.200%
Jobs required by Ha Noi:        35.003%
Jobs required by Da Nang:       3.927%
Jobs required by International: 2.370%


Which positions does recruiter looking for?

sns.countplot(x='position', data=jobs)
<matplotlib.axes._subplots.AxesSubplot at 0x7fb35e016290>

png

It’s obvious that Specialist position is the hottest job.

Now, which skills are expected by companies?

# Need to improve this code

skill_counter = defaultdict(int)
for row in jobs.skills:
    for skill in row:
        skill = skill.lower()
        if 'c++' in skill:
            skill_counter['C++'] += 1
        if 'java' in skill or 'j2ee' in skill or 'servlet' in skill or 'jsp' in skill or 'hibernate' in skill:
            skill_counter['Java'] += 1
        if 'ios' in skill or 'objective-c' in skill or 'objective c' in skill:
            skill_counter['iOS'] += 1
        if 'python' in skill or 'django' in skill or 'flask' in skill:
            skill_counter['Python'] += 1
        if 'html' in skill or 'css' in skill or 'css3' in skill:
            skill_counter['HTML/CSS'] += 1
        if 'android' in skill:
            skill_counter['Android'] += 1
        if 'oracle' in skill or 'sql' in skill or 'sql server' in skill or 'mysql' in skill or \
            'database' in skill or 'postgres' in skill:
            skill_counter['Database'] += 1
        if 'linux' in skill or 'redhat' in skill or 'centos' in skill or 'ubuntu' in skill:
            skill_counter['Linux'] += 1
        if 'cisco' in skill or 'ccna' in skill or 'ccnp' in skill or 'ccie' in skill or \
            'network' in skill or 'routing' in skill or 'switching' in skill:
            skill_counter['Network'] += 1
        if 'c#' in skill or '.net' in skill:
            skill_counter['C#/.Net'] += 1
        if 'javascript' in skill:
            skill_counter['Javascript'] += 1
        if 'php' in skill:
            skill_counter['PHP'] += 1
        if 'scrum' in skill:
            skill_counter['Agile Scrum'] += 1
        if 'brse' in skill or 'bridge' in skill:
            skill_counter['BrSE'] += 1

skills = pd.DataFrame(dict(skill_counter), index=range(1))
skills.iloc[0].sort_values(ascending=False).plot(kind='barh')
# sns.barplot(data=pd.DataFrame(dict(skill_counter), index=range(1)), orient='h')
# sns.barplot?
<matplotlib.axes._subplots.AxesSubplot at 0x7fb35bb182d0>

png

This might not be 100% percent accurate, but skill, we know that Java is always the dominated programming language for a very long time.

Which companies are hiring people to work outside Vietnam?

jobs[jobs.location.str.contains('International')][['company', 'job_title']]
company job_title
9 Gmo-z.com Vietnam Lab Center Kỹ Sư Phần Mềm Biết Tiếng Nhật (bse/ Smartphon...
19 Saritasa Senior iOS Mobile Developer - Competitive Salary
76 Harvey Nash Viet Nam Senior Sharepoint Developer (go Onsite Singapore)
111 Robert Bosch Engineering Vietnam *urgent* 3 Senior Liferay Engineers (4 - 6 Years)
170 Công Ty TNHH Thương Mại Và Dịch Vụ Tri Thức Mới [tuyển Gấp] 8 Lập Trình Viên .NET, Php, Java C...
178 Công Ty TNHH Phần Mềm FPT Kỹ Sư Cầu Nối (BrSE) Làm Việc Dài Hạn Tại Nhật...
211 Neos Corporation Web Developer - Làm Việc Tại Nhật Bản
268 Công Ty TNHH Phần Mềm FPT (FPT Software) 10 Testers Tiếng Nhật N2 Có Cơ Hội Onsite Mỹ
305 Neos Corporation 05 Kỹ Sư iOS ( High Salary )
306 Neos Corporation 05 Kỹ Sư Android ( High Salary )
341 CSC Vietnam Scrum Master for iOS (onsite Dubai)
460 Synova Solutions Urgent - PHP Developer (drupal/wordpress/symfo...
620 Công Ty TNHH Công Nghệ Phần Mềm Kaopiz Onsite Engineer
628 Tổng Công Ty Cổ Phần Bảo Hiểm Sài Gòn - Hà Nội... Chuyên Viên Công Nghệ Thông Tin ( Làm Việc Tại...
664 Công Ty Phần Mềm Của Nhật Bản Kỹ Sư Hệ Thống Biết Tiếng Nhật (BrSE)
698 Neos Corporation 03 Bridge System Engineer ( $1000 ~ $2000)
704 Imaginato ($700 ~$1500) E-commerce PHP Engineer
711 Finexus Sdn Bhd *** Hot :15 Lập Trình Viên Java Làm Việc Tại M...
724 Sutherland Global Services Malaysia Technical Support
728 株式会社プロフェースシステムズ System Engineer To Japan ($2000 ~ $5000)
847 Marketjs Game Developer ( Mobile / Web / Programmer / H...
862 Công Ty TNHH Thương Mại Toàn Cầu [$2100 - $4000] BSE . Cầu Nối Kỹ Sư Việt - Nhật
864 Lucky Ruby Casino & Resort System Programmer
885 COMIT Corporation Kỹ Sư Tối Ưu Hóa Mạng Vô Tuyến 3G (Experienced...
895 Synova Solutions Urgent - Senior Frontend Developer (location: ...
949 Iprice Group Senior Software Engineer
962 Paxcreation Senior PHP Developer (exp 3+ Years) - Up To 10...
993 The Database Consultants, LLC SharePoint Developer/admin To Work in Hawaii, USA
995 ZTE Corporation Oversea Project Manager ( Working in Peru )- U...
1023 Công Ty Cổ Phần Phát Triển Nguồn Nhân Lực Quốc... Kỹ Sư Công Nghệ Thông Tin (Làm Việc Tại Nhật Bản)
1062 Saritasa Senior Angularjs Javascript (js) Developer - C...
1106 Onea JAVA Developer
1219 Finexus Sdn Bhd 10 Lập Trình Viên Java (sắp Tốt Nghiệp - 3 Năm...
1229 Ars Nova Viet Nam Company Limited Kỹ Sư Phần Mềm Được Đào Tạo Tiếng Nhật và Sẽ L...
1254 Rivercrane Viet Nam PHP Developer (High Salary!)


Have fun!


« Homepage