Chapter 2 Data sources
2.1 Data Description
The data used for this project is downloaded from several different websites that contain census data related to telecommunicating population. Depending on the different facet of the problem we want to investigate from, the data sources vary.
One of the website that we got our data from is the U.S Bureau of Labor Statistics (BLS), which is known as one of the largest government published databases. Specifically, we utilized the data summarized from this web-page Employed persons working on main job at home and at their workplace and time spent working at each location by occupation to get the data about annual WFH population before COVID-19 (2013-2019) by sectors, and utilized the data summarized from this webpage Effects of the coronavirus COVID-19 pandemic as the WFH data since May 2020, which measures the effect of COVID on the labor market from May 2020 to March 2021. Data downloaded from the previous website consist of 20 tables, containing 13 variables describing total employed population and WFH-only workers’ population within each occupation sector over years. Data downloaded from the latter website consists 11 tables, containing demographic (
Race
,Gender
,Ages
,…), occupational, industrial and other characteristics measures about the total and WFH population.We accessed employment payrolls data collected in Employment, Hours, and Earnings from the Current Employment Statistics survey (National) by BLS from Employment Situation Table. The source provides information on monthly average working hours and payrolls for employees in different sectors from 2011 to 2021. All sectors belong to
Total private
sector, which is divided into two main sectors:Goods-producing
andPrivate service-providing
.Goods-producing
includesMining and logging
,Construction
, andManufactoring
, which consists ofDurable goods
andNondurable goods
, whilePrivate service-providing
includesTrade, transportation, and utilities
,Information
,Financial activities
,Professional and business services
,Education and health services
,Leisure and hospitality
, andOther services
. UnderTrade, transportation, and utilities
, there areWholesale trade
,Retail trade
,Transportation and warehousing
, andUtilities
sectors.We also considered U.S. BLS Beta Labs as our second resource. It provided time-series employment productivity data. We wanted to show the trend of industrial-wise productivity with its correlation with the WFH portion of the labor force, so we selected the quarterly data from 2018 to 2020 by different sectors. There were six sectors in total:
Business
,Non-farm business
,Manufacturing
,Durable goods
,Non-durable goods
andNon-financial corporations
. In order to compare the productivity changes, we took several features:Productivity
,Working hours
andUnit labor costs
. For data from each sector and each feature, we chose the same unit for comparison: output per hour for labor productivity, average weekly hours for working hours and unit labor costs. The source provided us with 18 raw data tables in total, we would merge and combine these data tables to come up with several visualized plots.
2.2 Issues/Problems with Data
Though the data provides suitable information for our topic, it has some limitations. One issue is that for covid-concurrent data, we can only have access to the monthly summaries no earlier than May, 2020, while the before-covid datasets only have yearly summaries, which end at 2019. Since we don’t have the first several month’s data for 2020, we cannot generate a yearly summary from the covid-concurrent table. That would give us a data gap from Jan to May of 2020. We might need to give an estimate or generate a visualization of the trend when dealing with this problem. Another issue is that there is no direct data linking employment payrolls with WFH. We will need to utilize the resources in the first two links in hope to generate some insights on the change in number of WFH laborers and its possible effect on payroll amounts.
For the data extracted from U.S. BLS Beta Labs, those data sets are quarterly averaged or weekly averaged. Since data for a more specific time period was not published, we could not depict a more time-sensitive change. Also, we were not able to get the first quarter data of 2021. For work from home data, we have a missing value, but this did not influence the initial analysis. Missing values were stated in “04-missing part”. Another issue was that the sources provided different definitions for industrial sectors, so the sectors for productivity and payrolls could not be uniformed. To solve this issue, we matched some of the sectors to estimate our data.