A Personal Corner

Introduction

I’ve been looking for a Data Analysis job recently and failed to do so. Actually I got several offers for Python Developer position but the work in these companies are quite boring so I left. The fact is Data Science / Data Mining / Machine Learning jobs in the north of Vietnam are quite rare. There are a lot more jobs in this field in Ho Chi Minh city.

With my free time I decided to do some analysis about job market for the last few months in Vietnam. This is written with the main purpose to consolidate my knowledge about exploratory data analysis, so I will not take it very seriously about data source. I wrote some scripts to crawl job data from only one website Vietnamworks.com for the last few months. I tried to get data for the first 5 months of 2016, but it seems Vietnamworks doesn’t show their past data after a specific period. The best I could get is from April. Not much, but this is for fun anyway.

In my last post, I showed you how to crawl job data from Vietnamworks using BeautifulSoup. Scripts used in that post will be reused here, but the scope is a little bit wider (from 26 URLs and several other features).

This post will focus on scaping data for IT jobs, but you can change the base URL to work with any other kind of job you want. Just go to Vietnamworks, use the website’s filter to select the job categories you want to play with, and get the base URL and assign it to the base_url variable in get_all_urls() method.

Note: You can view this blog article directly in IPython notebook format here.

Package requirements:

If you do not have those packages installed on your system, please do so before continue:

Python 2.7
beautifulsoup4
matplotlib
seaborn
pandas

# Imports & Settings

# Make sure packages in my virtual environment are included
import sys
sys.path.append('/home/hoanvu/anaconda2/envs/ds/lib/python2.7/site-packages/')

import requests
import pickle
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup
from collections import Counter, defaultdict
from datetime import datetime

%pylab inline
sns.set_context('poster')

# No of URLs to crawl from Vietnamworks
MAX_PAGE = 27

Populating the interactive namespace from numpy and matplotlib


WARNING: pylab import has clobbered these variables: ['datetime']
`%matplotlib` prevents importing * from pylab and numpy

Next, we will write a method to get all URLs for the job categories of your choice.

base_url: this is the URL of the first page after you configure your filter on Vietnamworks website and press ‘Search’
extended_url_pages: we want to crawl more than just the first page. base_url is the first page, all subsequence pages will have the extension in the URL, with page-2 for the second page, page-3 for the third page and so on

We we get advantage of this uniform to construct our list of URLs

def get_all_urls():
    """
    Scan through all URLs defined by MAX_PAGE and return the URL
    
    return: list of URLs
    """
    
    base_url = 'http://www.vietnamworks.com/it-hardware-networking-it-software-jobs-i55,35-en'
    extended_url_pages = ['/page-{}'.format(page_number) for page_number in xrange(2, MAX_PAGE)]
    urls = [base_url + extended_page for extended_page in extended_url_pages]
    urls.insert(0, base_url)
    return urls

Data collection

Features that need to be collect:

Job title
Technical skills
Work location
Position
Company name
Job post date

This method get_jobs_in_url() get details for all job in a single URL. You can see my previous post for explanation about the code.

def get_jobs_in_url(url):
    """
    Take a single URL and return a pandas DataFrame contains details about all jobs in that URL.
    
    return: pandas.DataFrame
    """
    req = requests.get(url)
    soup = BeautifulSoup(req.content, 'html.parser')
    
    jobs_table = soup.find('table', {'class': 'link-list'})
    tr_tags = jobs_table.find_all('tr', {'class': 'job-post'})
    
    job_title = []
    job_company = []
    job_location = []
    job_position = []
    job_skills = []
    job_date = []
    
    for job in tr_tags:
        job_title.append(job.find('a', {'class': 'job-title'}).contents[0])
        job_company.append(job.find('span', 'name').contents[0])
        job_location.append(job.find('p', {'class': 'job-info'}).contents[1].find('span').contents[0])
        job_position.append(job.find('p', {'class': 'job-info'}).contents[1].find_all('span')[1].contents[0])
                            
        skill_list_data = job.find('div', {'class': 'skills'})
        if skill_list_data:
            skill_list = skill_list_data.find_all('em', {'class': 'text-clip'})
            
        job_skills.append([skill.contents[0] for skill in skill_list])
        
        # Job post date
        d = job.find('span', {'class': 'views'}).find('span').contents[0].split(' ')[1]
        if d == 'Today':
            d = datetime.datetime.now().strftime("%d/%m/%Y")
        job_date.append(d)
    
        # Job salary
        
    return pd.DataFrame({
            'job_title': job_title,
            'company': job_company,
            'location': job_location,
            'position': job_position,
            'skills': job_skills,
            'post_date': job_date
        })

This method get_all_jobs() just scan through all URLs from get_all_urls() and get details for all jobs listed in each and every page:

def get_all_jobs():
    """
    Scans all URLs and crawls all jobs from all pages
    
    return: pandas.DataFrame contains every possible job that we can find
    """
    urls = get_all_urls()
    data = pd.DataFrame()
    for url in urls:
        data = data.append(get_jobs_in_url(url), ignore_index=True)
        
    return data

jobs = get_all_jobs()

# Convert the job post date to datetime type
jobs['post_date'] = pd.to_datetime(jobs.post_date, format="%d/%m/%Y")

Let’s take a quick look at our data:

jobs.head()

	company	job_title	location	position	post_date	skills
0	Công Ty Cổ Phần Sabre Việt Nam	Chuyên Viên IT	Ha Noi	Experienced (non-manager)	2016-06-02	[Máy Tính, Mạng, Triển Khai Tổng Đài IP]
1	Công Ty TNHH Vận Hành Jmango Việt Nam	Java Developer (quantity: 3)	Ha Noi	Experienced (non-manager)	2016-05-26	[SQL Server-mysql, HTML - CSS - Javascript, Ja...
2	CSC Vietnam	SAP Consultants (real Estate, Fi, Co, Ps, Mm, ...	Ho Chi Minh	Experienced (non-manager)	2016-05-25	[SAP & ERP, SAP Consultant ( FI / CO/ MM ) - R...
3	CSC Vietnam	Senior Software Engineer (with Japanese Language)	Ho Chi Minh	Experienced (non-manager)	2016-05-18	[Interpreter - Translator, Bridge Engineer ( J...
4	CSC Vietnam	Java Senior Software Engineer	Ho Chi Minh	Experienced (non-manager)	2016-05-18	[J2EE Architecture, Java Developer - IT Softwa...

Nice, seems that our data is not very clean, but it says that our code until now is working. Now, let’s jump to the most interesting part and see what our data says?

Exploratory Data Analysis

At first I planned to clean this dataset a little bit more, but then I realize it would be better if I keep it this way and transform each column based on the information I need. If you read this, feel free to take the dataset and tweak it the way you want.

We will try to answer few questions as we go.

First, what is the general status of Vietnam job market?

How many days did we collect the data?

print jobs.post_date.max() - jobs.post_date.min()

37 days 00:00:00

How many jobs were posted during this period?

print "Total jobs: {}".format(len(jobs))

Total jobs: 1279

1277 jobs were posted in only 37 days, I think that this is very interesting number to show that the Vietnam job market is very bustle. If we collect data from other sites, there maybe even much bigger number of jobs posted.

Which city have the most demand for job recruitment?

From the graph below, it’s obvious that Ho Chi Minh and Ha Noi are two cities which have huge demand for human resources in IT, which makes sense! Ho Chi Minh is by far the most demanding city.

Da Nang is also a very hot city for job hunter. There are some jobs that have work location as International, this is very interesting!

all_locations = []
for location in jobs.location:
    if ',' in location:
        all_locations.extend(location.split(', '))
    else:
        all_locations.append(location)

# Remove cities that have less than 5 jobs, just to make grap clearer
most_cities = [key for key, value in Counter(all_locations).iteritems() if value > 5]
new_cities = [city for city in all_locations if city in most_cities]

sns.countplot(y=new_cities)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb35e016ad0>

png

What’s the percentage?

from __future__ import division

location_counter = Counter(all_locations)
total_jobs = sum(location_counter.values())
jobs_in_hcm = location_counter['Ho Chi Minh']
jobs_in_hn = location_counter['Ha Noi']
jobs_in_dn = location_counter['Da Nang']
jobs_in_int = location_counter['International']

print "Total jobs:                     {}".format(total_jobs)
print "Jobs required by Ho Chi Minh:   {:.3f}%".format((jobs_in_hcm / total_jobs) * 100.0 )
print "Jobs required by Ha Noi:        {:.3f}%".format((jobs_in_hn / total_jobs) * 100.0 ) 
print "Jobs required by Da Nang:       {:.3f}%".format((jobs_in_dn / total_jobs) * 100.0 )
print "Jobs required by International: {:.3f}%".format((jobs_in_int / total_jobs) * 100.0) 

Total jobs:                     1477
Jobs required by Ho Chi Minh:   52.200%
Jobs required by Ha Noi:        35.003%
Jobs required by Da Nang:       3.927%
Jobs required by International: 2.370%

Which positions does recruiter looking for?

sns.countplot(x='position', data=jobs)

<matplotlib.axes._subplots.AxesSubplot at 0x7fb35e016290>

png

It’s obvious that Specialist position is the hottest job.

Now, which skills are expected by companies?

# Need to improve this code

skill_counter = defaultdict(int)
for row in jobs.skills:
    for skill in row:
        skill = skill.lower()
        if 'c++' in skill:
            skill_counter['C++'] += 1
        if 'java' in skill or 'j2ee' in skill or 'servlet' in skill or 'jsp' in skill or 'hibernate' in skill:
            skill_counter['Java'] += 1
        if 'ios' in skill or 'objective-c' in skill or 'objective c' in skill:
            skill_counter['iOS'] += 1
        if 'python' in skill or 'django' in skill or 'flask' in skill:
            skill_counter['Python'] += 1
        if 'html' in skill or 'css' in skill or 'css3' in skill:
            skill_counter['HTML/CSS'] += 1
        if 'android' in skill:
            skill_counter['Android'] += 1
        if 'oracle' in skill or 'sql' in skill or 'sql server' in skill or 'mysql' in skill or \
            'database' in skill or 'postgres' in skill:
            skill_counter['Database'] += 1
        if 'linux' in skill or 'redhat' in skill or 'centos' in skill or 'ubuntu' in skill:
            skill_counter['Linux'] += 1
        if 'cisco' in skill or 'ccna' in skill or 'ccnp' in skill or 'ccie' in skill or \
            'network' in skill or 'routing' in skill or 'switching' in skill:
            skill_counter['Network'] += 1
        if 'c#' in skill or '.net' in skill:
            skill_counter['C#/.Net'] += 1
        if 'javascript' in skill:
            skill_counter['Javascript'] += 1
        if 'php' in skill:
            skill_counter['PHP'] += 1
        if 'scrum' in skill:
            skill_counter['Agile Scrum'] += 1
        if 'brse' in skill or 'bridge' in skill:
            skill_counter['BrSE'] += 1

skills = pd.DataFrame(dict(skill_counter), index=range(1))
skills.iloc[0].sort_values(ascending=False).plot(kind='barh')
# sns.barplot(data=pd.DataFrame(dict(skill_counter), index=range(1)), orient='h')
# sns.barplot?

<matplotlib.axes._subplots.AxesSubplot at 0x7fb35bb182d0>

png

This might not be 100% percent accurate, but skill, we know that Java is always the dominated programming language for a very long time.

Which companies are hiring people to work outside Vietnam?

jobs[jobs.location.str.contains('International')][['company', 'job_title']]

	company	job_title
9	Gmo-z.com Vietnam Lab Center	Kỹ Sư Phần Mềm Biết Tiếng Nhật (bse/ Smartphon...
19	Saritasa	Senior iOS Mobile Developer - Competitive Salary
76	Harvey Nash Viet Nam	Senior Sharepoint Developer (go Onsite Singapore)
111	Robert Bosch Engineering Vietnam	urgent 3 Senior Liferay Engineers (4 - 6 Years)
170	Công Ty TNHH Thương Mại Và Dịch Vụ Tri Thức Mới	[tuyển Gấp] 8 Lập Trình Viên .NET, Php, Java C...
178	Công Ty TNHH Phần Mềm FPT	Kỹ Sư Cầu Nối (BrSE) Làm Việc Dài Hạn Tại Nhật...
211	Neos Corporation	Web Developer - Làm Việc Tại Nhật Bản
268	Công Ty TNHH Phần Mềm FPT (FPT Software)	10 Testers Tiếng Nhật N2 Có Cơ Hội Onsite Mỹ
305	Neos Corporation	05 Kỹ Sư iOS ( High Salary )
306	Neos Corporation	05 Kỹ Sư Android ( High Salary )
341	CSC Vietnam	Scrum Master for iOS (onsite Dubai)
460	Synova Solutions	Urgent - PHP Developer (drupal/wordpress/symfo...
620	Công Ty TNHH Công Nghệ Phần Mềm Kaopiz	Onsite Engineer
628	Tổng Công Ty Cổ Phần Bảo Hiểm Sài Gòn - Hà Nội...	Chuyên Viên Công Nghệ Thông Tin ( Làm Việc Tại...
664	Công Ty Phần Mềm Của Nhật Bản	Kỹ Sư Hệ Thống Biết Tiếng Nhật (BrSE)
698	Neos Corporation	03 Bridge System Engineer ( $1000 ~ $2000)
704	Imaginato	($700 ~$1500) E-commerce PHP Engineer
711	Finexus Sdn Bhd	*** Hot :15 Lập Trình Viên Java Làm Việc Tại M...
724	Sutherland Global Services Malaysia	Technical Support
728	株式会社プロフェースシステムズ	System Engineer To Japan ($2000 ~ $5000)
847	Marketjs	Game Developer ( Mobile / Web / Programmer / H...
862	Công Ty TNHH Thương Mại Toàn Cầu	[$2100 - $4000] BSE . Cầu Nối Kỹ Sư Việt - Nhật
864	Lucky Ruby Casino & Resort	System Programmer
885	COMIT Corporation	Kỹ Sư Tối Ưu Hóa Mạng Vô Tuyến 3G (Experienced...
895	Synova Solutions	Urgent - Senior Frontend Developer (location: ...
949	Iprice Group	Senior Software Engineer
962	Paxcreation	Senior PHP Developer (exp 3+ Years) - Up To 10...
993	The Database Consultants, LLC	SharePoint Developer/admin To Work in Hawaii, USA
995	ZTE Corporation	Oversea Project Manager ( Working in Peru )- U...
1023	Công Ty Cổ Phần Phát Triển Nguồn Nhân Lực Quốc...	Kỹ Sư Công Nghệ Thông Tin (Làm Việc Tại Nhật Bản)
1062	Saritasa	Senior Angularjs Javascript (js) Developer - C...
1106	Onea	JAVA Developer
1219	Finexus Sdn Bhd	10 Lập Trình Viên Java (sắp Tốt Nghiệp - 3 Năm...
1229	Ars Nova Viet Nam Company Limited	Kỹ Sư Phần Mềm Được Đào Tạo Tiếng Nhật và Sẽ L...
1254	Rivercrane Viet Nam	PHP Developer (High Salary!)

Have fun!

« Homepage

Recent Posts

Posts by Category

Useful Links

Vietnam IT Jobs Analysis
Jun 2, 2016

Introduction

Package requirements:

Data collection

Let’s take a quick look at our data:

Exploratory Data Analysis

First, what is the general status of Vietnam job market?

Which city have the most demand for job recruitment?

What’s the percentage?

Which positions does recruiter looking for?

Now, which skills are expected by companies?

Which companies are hiring people to work outside Vietnam?

Have fun!

Recent Posts

Posts by Category

Useful Links

Vietnam IT Jobs Analysis Jun 2, 2016

Introduction

Package requirements:

Data collection

Let’s take a quick look at our data:

Exploratory Data Analysis

First, what is the general status of Vietnam job market?

Which city have the most demand for job recruitment?

What’s the percentage?

Which positions does recruiter looking for?

Now, which skills are expected by companies?

Which companies are hiring people to work outside Vietnam?

Have fun!

Vietnam IT Jobs Analysis
Jun 2, 2016