zoukankan      html  css  js  c++  java
  • Python 规范化LinkedIn用户联系人的职位名

    CODE:

    #!/usr/bin/python 
    # -*- coding: utf-8 -*-
    
    '''
    Created on 2014-8-19
    @author: guaguastd
    @name: job_title_standard.py
    '''
    
    import os
    import csv
    from collections import Counter
    from operator import itemgetter
    from prettytable import PrettyTable
    
    # specify csv directory
    CSV_FILE = os.path.join(r"E:", "\", "eclipse", "LinkedIn", "dfile", "my_connections.csv")
    
    # define a set of transforms that converts the first item
    # to the second item
    transforms = [
        ('Sr.', 'Senior'),
        ('Sr', 'Senior'),
        ('Jr.', 'Junior'),
        ('Jr', 'Junior'),
        ('CEO', 'Chief Executive Officer'),
        ('COO', 'Chief Operating Officer'),
        ('CTO', 'Chief Technology Officer'),
        ('CFO', 'Chief Finance Officer'),
        ('VP', 'Vice President'),
    ]
    
    csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='"')
    contacts = [row for row in csvReader]
    
    # Read in a list of titles and split 
    # apart any combined titles like "President/CEO."
    # "President & CEO", "President and CEO"
    titles = []
    for contact in contacts:
        titles.extend([t.strip() for t in contact['Job Title'].split('/')
                      if contact['Job Title'].strip() != ''])
    
    # Replace common/known abbreviations
    for i, _ in enumerate(titles):
        for transform in transforms:
            titles[i] = titles[i].replace(*transform)
    
    # Print out a table of titles sorted by frequency
    pt = PrettyTable(field_names=['Title', 'Freq'])
    pt.align = 'l'
    c = Counter(titles)
    [pt.add_row([title, freq])
    for (title, freq) in sorted(c.items(), key=itemgetter(1), reverse=True)
        if freq > 0]
    print pt
    
    # Print out a table of tokens sorted by frequency
    tokens = []
    for title in titles:
        tokens.extend([t.strip(',') for t in title.split()])
    pt = PrettyTable(field_names=['Token', 'Freq'])
    pt.align = 'l'
    c = Counter(tokens)
    [pt.add_row([token, freq])
    for (token, freq) in sorted(c.items(), key=itemgetter(1), reverse=True)
        if freq > 0 and len(token) > 2]
    print pt

    RESULT:

    +-----------------------------------+------+
    | Title                             | Freq |
    +-----------------------------------+------+
    | Senior Software Developer         | 1    |
    | Sales Manager                     | 1    |
    | Software Manager                  | 1    |
    | Online Marketing Manager          | 1    |
    | Senior Consultant                 | 1    |
    | Chief Executive Officer & Founder | 1    |
    | Director                          | 1    |
    | S                                 | 1    |
    | Student                           | 1    |
    | Senior Software Engineer          | 1    |
    | ???

    | 1 | +-----------------------------------+------+ +------------+------+ | Token | Freq | +------------+------+ | Manager | 3 | | Senior | 3 | | Software | 3 | | Marketing | 1 | | Founder | 1 | | Consultant | 1 | | Executive | 1 | | Sales | 1 | | Developer | 1 | | Director | 1 | | Chief | 1 | | Officer | 1 | | Student | 1 | | Online | 1 | | ???

    | 1 | | Engineer | 1 | +------------+------+



  • 相关阅读:
    JSONObject简介
    android:layout_gravity 和android:gravit的区别?
    CountDownTimer,0,0
    java应用集锦9:httpclient4.2.2的几个常用方法,登录之后访问页面问题,下载文件
    HttpClient学习系列 -- 学习总结
    创建多线程的HttpClient
    HttpClient4.X 升级 入门 + http连接池使用
    Java Executors(线程池)
    [微软官方]SQLSERVER的兼容级别
    vSphere Client 连接ESXi 或者是vCenter 时虚拟机提示VMRC异常的解决办法
  • 原文地址:https://www.cnblogs.com/lytwajue/p/7224304.html
Copyright © 2011-2022 走看看