时间:2021-07-01 10:21:17 帮助过:21人阅读
def porsche_query(): #{‘字段名‘:‘字段值‘} query = {‘manufacturer‘:‘Porsche‘} return query
例2:范围查询 (找出在二十一世纪建成的所有城市注意运算符 $gte,$lte)
{ ‘areaCode‘: [‘916‘], ‘areaLand‘: 109271000.0, ‘country‘: ‘United States‘, ‘elevation‘: 13.716, ‘foundingDate‘: datetime.datetime(2000, 7, 1, 0, 0), ‘governmentType‘: [‘Council\u2013manager government‘], ‘homepage‘: [‘http://elkgrovecity.org/‘], ‘isPartOf‘: [‘California‘, u‘Sacramento County California‘], ‘lat‘: 38.4383, ‘leaderTitle‘: ‘Chief Of Police‘, ‘lon‘: -121.382, ‘motto‘: ‘Proud Heritage Bright Future‘, ‘name‘: ‘City of Elk Grove‘, ‘population‘: 155937, ‘postalCode‘: ‘95624 95757 95758 95759‘, ‘timeZone‘: [‘Pacific Time Zone‘], ‘utcOffset‘: [‘-7‘, ‘-8‘] }
def range_query():
#使用$gt,$lt来限定查询的条件的范围 query = {‘foundingDate‘:{‘$gte‘:datetime(2001,1,1),‘$lt‘:datetime(2100,12,31)}} return query
例3:找出在德国、英国或日本组装的所有汽车
{ "layout" : "rear mid-engine rear-wheel-drive layout", "name" : "Porsche Boxster", "productionYears" : [ ], "modelYears" : [ ], "bodyStyle" : "roadster", "assembly" : [ "Finland", "Germany", "Stuttgart", "Uusikaupunki" ], "class" : "sports car", "manufacturer" : "Porsche" }
def in_query():
#使用$in来找出满足调节的集合 query = {‘assembly‘:{‘$in‘:[‘Germany‘,‘England‘,‘Japan‘]}} return query
例4:点表示法 找出宽度大于 2.5 的所有汽车
{ "_id" : ObjectId("52fd438b5a98d65507d288cf"), "engine" : "Crawler-transporter__1", "dimensions" : { "width" : 34.7472, "length" : 39.9288, "weight" : 2721000 }, "transmission" : "16 traction motors powered by four generators", "modelYears" : [ ], "productionYears" : [ ], "manufacturer" : "Marion Power Shovel Company", "name" : "Crawler-transporter" }
def dot_query():
#使用.来表示父节点中的子节点 query = {‘dimensions.width‘:{‘$gt‘:2.5}} return query
聚合框架查询
例5:找出创建推特时最常用的应用
示例文件
{ "_id" : ObjectId("5304e2e3cc9e684aa98bef97"), "text" : "First week of school is over :P", "in_reply_to_status_id" : null, "retweet_count" : null, "contributors" : null, "created_at" : "Thu Sep 02 18:11:25 +0000 2010", "geo" : null, "source" : "web", "coordinates" : null, "in_reply_to_screen_name" : null, "truncated" : false, "entities" : { "user_mentions" : [ ], "urls" : [ ], "hashtags" : [ ] }, "retweeted" : false, "place" : null, "user" : { "friends_count" : 145, "profile_sidebar_fill_color" : "E5507E", "location" : "Ireland :)", "verified" : false, "follow_request_sent" : null, "favourites_count" : 1, "profile_sidebar_border_color" : "CC3366", "profile_image_url" : "http://a1.twimg.com/profile_images/1107778717/phpkHoxzmAM_normal.jpg", "geo_enabled" : false, "created_at" : "Sun May 03 19:51:04 +0000 2009", "description" : "", "time_zone" : null, "url" : null, "screen_name" : "Catherinemull", "notifications" : null, "profile_background_color" : "FF6699", "listed_count" : 77, "lang" : "en", "profile_background_image_url" : "http://a3.twimg.com/profile_background_images/138228501/149174881-8cd806890274b828ed56598091c84e71_4c6fd4d8-full.jpg", "statuses_count" : 2475, "following" : null, "profile_text_color" : "362720", "protected" : false, "show_all_inline_media" : false, "profile_background_tile" : true, "name" : "Catherine Mullane", "contributors_enabled" : false, "profile_link_color" : "B40B43", "followers_count" : 169, "id" : 37486277, "profile_use_background_image" : true, "utc_offset" : null }, "favorited" : false, "in_reply_to_user_id" : null, "id" : NumberLong("22819398300") }
def make_pipeline(): pipeline = [
# 1.根据source进行分组,然后统计出每个分组的数量,放在count中
# 2.根据count字段降序排列 {‘$group‘:{‘_id‘:‘$source‘, ‘count‘:{‘$sum‘:1}}}, {‘$sort‘:{‘count‘:-1}} ] return pipeline
例6:找出巴西利亚时区的用户,哪些用户发推次数不低于 100 次,哪些用户的关注者数量最多
def make_pipeline(): #1.使用$match将数据进行筛选
#2.使用$project(投影运算),获取结果的返回值
#3.使用$sort根据followers的值降序排列
#4.使用$limit来限制展示的条数,第一条就是满足条件的结果 pipeline = [ {‘$match‘:{‘user.time_zone‘:‘Brasilia‘, ‘user.statuses_count‘:{‘$gte‘:100}}}, {‘$project‘:{‘followers‘:‘$user.followers_count‘, ‘screen_name‘:‘$user.screen_name‘, ‘tweets‘:‘$user.statuses_count‘}}, {‘$sort‘:{‘followers‘:-1}}, {‘$limit‘:1} ] return pipeline
例7:找出印度的哪个地区包括的城市最多
示例文件
{ "_id" : ObjectId("52fe1d364b5ab856eea75ebc"), "elevation" : 1855, "name" : "Kud", "country" : "India", "lon" : 75.28, "lat" : 33.08, "isPartOf" : [ "Jammu and Kashmir", "Udhampur district" ], "timeZone" : [ "Indian Standard Time" ], "population" : 1140 }
def make_pipeline(): #1.根据$match筛选出国家
#2.根据$unwind将列表形式的字段进行拆分
#3.根据$group将拆分的项进行分组,并统计出总数count
#4.根据$sort将总数count进行降序排列,找出结果集 pipeline = [ {‘$match‘:{‘country‘:‘India‘}}, {‘$unwind‘:‘$isPartOf‘}, {‘$group‘:{‘_id‘:‘$isPartOf‘, ‘count‘:{‘$sum‘:1}}}, {‘$sort‘:{‘count‘:-1}} ] return pipeline
例8:找出每个用户的所有推特文本数量,仅数出推特数量在前五名的用户。
def make_pipeline(): #1.使用$group根据screen_name进行分组
#2.使用$push将所有的text的值放入到tweet_texts中
#3.使用$sum统计出总数
#4.使用$sort将总数count进行降序排列
#5.使用$limit获取前5的用户 pipeline = [ {‘$group‘:{‘_id‘:‘$user.screen_name‘, ‘tweet_texts‘:{‘$push‘:‘$text‘}, ‘count‘:{‘$sum‘:1}}}, {‘$sort‘:{‘count‘:-1}}, {‘$limit‘:5} ] return pipeline
例9:找出印度各个地区的平均人口数量是多少
def make_pipeline(): #1.使用$match筛选出国家India
#2.使用$unwind对isPartOf进行拆分
#3.使用$group将isPartOf进行分组,在使用$avg计算出平均人口
#4.使用$group将avg的值进行展示 pipeline = [ {‘$match‘:{‘country‘:‘India‘}}, {‘$unwind‘:‘$isPartOf‘}, {‘$group‘:{‘_id‘:‘$isPartOf‘, ‘avgp‘:{‘$avg‘:‘$population‘}}}, {‘$group‘:{‘_id‘:‘India Regional City Population avg‘, ‘avg‘:{‘$avg‘:‘$avgp‘}}} ] return pipeline
练习
习题集03
1.仅处理 FIELDS 字典中作为键的字段,并返回清理后的值字典列表
需求:
1.根据 FIELDS 字典中的映射更改字典的键
2.删掉“rdf-schema#label”中的小括号里的多余说明,例如“(spider)”
3.如果“name”为“NULL”,或包含非字母数字字符,将其设为和“label”相同的值
4.如果字段的值为“NULL”,将其转换为“None”
5.如果“synonym”中存在值,应将其转换为数组(列表),方法是删掉“{}”字符,并根据“|” 拆分字符串。剩下的清理方式将由你自行决定,例如删除前缀“*”等。如果存在单数同义词,值应该依然是列表格式。
6.删掉所有字段前后的空格(如果有的话)
7.输出结构应该如下所示
[ { ‘label‘: ‘Argiope‘, ‘uri‘: ‘http://dbpedia.org/resource/Argiope_(spider)‘, ‘description‘: ‘The genus Argiope includes rather large and spectacular spiders that often ...‘, ‘name‘: ‘Argiope‘, ‘synonym‘: ["One", "Two"], ‘classification‘: { ‘family‘: ‘Orb-weaver spider‘, ‘class‘: ‘Arachnid‘, ‘phylum‘: ‘Arthropod‘, ‘order‘: ‘Spider‘, ‘kingdom‘: ‘Animal‘, ‘genus‘: None } }, { ‘label‘: ... , }, ... ]
import codecs import csv import json import pprint import re DATAFILE = ‘arachnid.csv‘ FIELDS ={‘rdf-schema#label‘: ‘label‘, ‘URI‘: ‘uri‘, ‘rdf-schema#comment‘: ‘description‘, ‘synonym‘: ‘synonym‘, ‘name‘: ‘name‘, ‘family_label‘: ‘family‘, ‘class_label‘: ‘class‘, ‘phylum_label‘: ‘phylum‘, ‘order_label‘: ‘order‘, ‘kingdom_label‘: ‘kingdom‘, ‘genus_label‘: ‘genus‘} def process_file(filename, fields): #获取FIELDS字典的keys列表 process_fields = fields.keys() #存放结果集 data = [] with open(filename, "r") as f: reader = csv.DictReader(f) #跳过文件中的前3行 for i in range(3): l = reader.next() #读文件 for line in reader: # YOUR CODE HERE #存放总的字典 res = {} #存放key是classification的子字典 res[‘classification‘] = {} #循环FIELDS字典的keys for field in process_fields: #获取excel中key所对应的val,条件1 tmp_val = line[field].strip() #生成json数据的新key,即是FIELDS字典的value new_key = FIELDS[field] #条件4 if tmp_val == ‘NULL‘: tmp_val = None #条件2 if field == ‘rdf-schema#label‘: tmp_val = re.sub(r‘\(.*\)‘,‘‘,tmp_val).strip() #条件3 if field == ‘name‘ and line[field] == ‘NULL‘: tmp_val = line[‘rdf-schema#label‘].strip() #条件5 if field == ‘synonym‘ and tmp_val: tmp_val = parse_array(line[field]) #子字典中所包含的的key if new_key in [‘kingdom‘,‘family‘,‘order‘,‘phylum‘,‘genus‘,‘class‘]: #子字典中所包含的的key的value res[‘classification‘][new_key] = tmp_val continue #将新的key和val放入到res中,然后加入到列表中返回 res[new_key] = tmp_val data.append(res) return data def parse_array(v): #解析数组 #如果以{开头和}结尾,删除左右的{},并以|进行分割,最后去除每一个项的空格,返回 if (v[0] == "{") and (v[-1] == "}"): v = v.lstrip("{") v = v.rstrip("}") v_array = v.split("|") v_array = [i.strip() for i in v_array] return v_array return [v]
def test(): #测试函数,如果不出错,结果正确 data = process_file(DATAFILE, FIELDS) print "Your first entry:" pprint.pprint(data[0]) first_entry = { "synonym": None, "name": "Argiope", "classification": { "kingdom": "Animal", "family": "Orb-weaver spider", "order": "Spider", "phylum": "Arthropod", "genus": None, "class": "Arachnid" }, "uri": "http://dbpedia.org/resource/Argiope_(spider)", "label": "Argiope", "description": "The genus Argiope includes rather large and spectacular spiders that often have a strikingly coloured abdomen. These spiders are distributed throughout the world. Most countries in tropical or temperate climates host one or more species that are similar in appearance. The etymology of the name is from a Greek name meaning silver-faced." } assert len(data) == 76 assert data[0] == first_entry assert data[17]["name"] == "Ogdenia" assert data[48]["label"] == "Hydrachnidiae" assert data[14]["synonym"] == ["Cyrene Peckham & Peckham"] if __name__ == "__main__": test()
2.向MonogoDB中插入数据
import json def insert_data(data, db): #直接调用insert方法插入即可 arachnids = db.arachnid.insert(data) if __name__ == "__main__": from pymongo import MongoClient client = MongoClient("mongodb://localhost:27017") db = client.examples with open(‘arachnid.json‘) as f: data = json.loads(f.read()) insert_data(data, db) print db.arachnid.find_one()
习题集04
实例文本
{ "_id" : ObjectId("52fe1d364b5ab856eea75ebc"), "elevation" : 1855, "name" : "Kud", "country" : "India",