Data conversion – the first step towards data processing
Convert all string to integers: ranging from 0 to n.
Age
continuous.
Workclass
Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
Fnlwgt
continuous.
Education
Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num
continuous.
marital-status
Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
Occupation
Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
Relationship
Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
Race
White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
Sex
Female, Male.
capital-gain
continuous.
capital-loss
continuous.
hours-per-week
continuous.
native-country
United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
I used a python program to deal with it, but when writing codes, especially the array, I find it is a waste of time to add quotation marks.
So I write a program to help me add the quotation marks:
1 import time 2 3 #start timing 4 t1 = time.time() 5 6 #open files 7 filereader = open('../resource/adult.data', 'r') 8 filewriter = open('../resource/converted_data.data', 'w') 9 10 #define arraies for conversion 11 workclass = ['?', 'Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked'] 12 13 education = ['?','Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool'] 14 15 marital_status = ['?','Married-civ-spouse','Divorced','Never-married','Separated','Widowed','Married-spouse-absent','Married-AF-spouse'] 16 17 occupation = ['?','Tech-support','Craft-repair','Other-service','Sales','Exec-managerial','Prof-specialty','Handlers-cleaners','Machine-op-inspct','Adm-clerical','Farming-fishing','Transport-moving','Priv-house-serv','Protective-serv','Armed-Forces'] 18 19 relationship = ['?','Wife','Own-child','Husband','Not-in-family','Other-relative','Unmarried'] 20 21 race = ['?','White','Asian-Pac-Islander','Amer-Indian-Eskimo','Other','Black'] 22 23 sex = ['?','Female','Male'] 24 25 native_country = ['?','United-States','Cambodia','England','Puerto-Rico','Canada','Germany','Outlying-US(Guam-USVI-etc)','India','Japan','Greece','South','China','Cuba','Iran','Honduras','Philippines','Italy','Poland','Jamaica','Vietnam','Mexico','Portugal','Ireland','France','Dominican-Republic','Laos','Ecuador','Taiwan','Haiti','Columbia','Hungary','Guatemala','Nicaragua','Scotland','Thailand','Yugoslavia','El-Salvador','Trinadad&Tobago','Peru','Hong','Holand-Netherlands'] 26 27 isover5K = ['?','>50K', '<=50K'] 28 29 #define a 2-dimension array 30 items = [workclass, education, marital_status, occupation, relationship, race, sex, native_country, isover5K] 31 32 #read file from lines 33 for eachline in filereader: 34 35 #iterate arraies 36 for item in items: 37 38 count = 0 39 40 #iterate strings and replace them with integers 41 for element in item: 42 43 #replace strings with integers 44 eachline = eachline.replace(element, str(count)) 45 46 count += 1 47 48 #write to file 49 filewriter.write(eachline) 50 51 52 53 #close files 54 filereader.close() 55 filewriter.close() 56 57 #end timing 58 t2 = time.time() 59 60 print('done') 61 print(str(t2 - t1))