电影《长津湖》到底有多牛,Python 对比分析 3 部电影后得出结果......

发布于 2021-10-11 00:16

大家好，我是J哥。

国庆小长假结束几天了，我们呢，也各自回到自己的工作岗位，继续开启我们的努力搬砖（摸鱼）生活。

从19年开始，每逢十一就会上映一部以 我和我的* ** 主题的电影来喜迎国庆，并且按照前两年票房趋势，这部电影的欢迎程度远大于同时期上映的其它电影，票房稳居第一。今年也不例外上映了一部《我和我的父辈》，以4个片段来讲述父母与孩子之间的故事，内容也受到大众的肯定。

但令人意外的是它的票房，要远低于另一部国庆档《长津湖》，热度和好评数远高于前者，关于其中的具体细节，本文以此来做个影评分析。本文挑选了在今年国庆上映三部电影，分别是《我和我的父辈》、《长津湖》以及《五个扑水的孩子》。《五个扑水的孩子》这部电影许多读者可能是第一次听到，热度远不及前两部，但它的确是在今年国庆期间上映的，而且根据猫眼排名，热度还不低，位居第三。

技术栈

开始之前，先说下本文所用到的技术栈，主要分为以下两方面：

语言：Python，javascript；

库：echarts，styleCloud；

影评对比分析

首先是从影评角度来分析一下，这里借助 Python 获取到三部电影的在豆瓣上的部分影评，关于豆瓣影评的爬取，这里我就不过多介绍了，不太熟悉的参考旧文，核心代码贴在下方：

headers = {
    "Cookie":"bid=tulFhUK9Lzo; douban-fav-remind=1; ll=\"118160\"; _vwo_uuid_v2=D55143433EAF6AF4EB29A904F8BE781A1|4d5d27125abfe3f6d29caa68ba504fed; ap_v=0,6.0; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1632849782%2C%22https%3A%2F%2Fwww.google.com%2F%22%5D; _pk_ses.100001.4cf6=*; __utma=30149280.52492667.1628212627.1629608096.1632849782.3; __utmc=30149280; __utmz=30149280.1632849782.3.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utma=223695111.788106722.1629608096.1629608096.1632849782.2; __utmb=223695111.0.10.1632849782; __utmc=223695111; __utmz=223695111.1632849782.2.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); __utmb=30149280.3.10.1632849782; _pk_id.100001.4cf6=254979423a09aae4.1629608097.2.1632851386.1629608485.",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36",
    "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-TW;q=0.6",
    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
}
#
#Part1 数据爬取改一下 id 即可
movieId = "35030151"
for offset in range(0,220,20):
    url = "https://movie.douban.com/subject/{}/comments?start={}&limit=20&status=P&sort=new_score".format(movieId,offset)
    res = requests.get(url,headers= headers)
    # print(res.text)
    soup = BeautifulSoup(res.text,'lxml')
    time.sleep(2)
    for comment_item in soup.select("#comments > .comment-item"):
        try:

            data_item = []
            avatar = comment_item.select(".avatar a img")[0].get("src")
            name = comment_item.select(".comment h3 .comment-info a")[0]
            rate = comment_item.select(".comment h3 .comment-info span:nth-child(3)")[0]
            date = comment_item.select(".comment h3 .comment-info span:nth-child(4)")[0]
            comment = comment_item.select(".comment .comment-content span")[0]
            # comment_item.get("div img").ge
            data_item.append(avatar)
            data_item.append(str(name.string).strip("\t"))
            data_item.append(str(rate.get("class")[0]).strip("allstar").strip('\t').strip("\n"))
            data_item.append(str(date.string).replace('\n','').strip('\t'))
            data_item.append(str(comment.string).strip("\t").strip("\n"))
            data_json ={
                'avatar':avatar,
                'name': str(name.string).strip("\t"),
                'rate': str(rate.get("class")[0]).strip("allstar").strip('\t').strip("\n"),
                'date' : str(date.string).replace('\n','').replace('\t','').strip(' '),
                'comment': str(comment.string).strip("\t").strip("\n")
            }
            if not (collection.find_one({'avatar':avatar})):
               print("data _json is {}".format(data_json))
               collection.insert_one(data_json)
        except Exception as e:
            print(e)

首先呢，我们先看下关于这三部电影的评论在每个时间段有没有数量方面的差异，于是就有了下面图1

图1

根据图1可视化结果来看，三部电影的评论趋势是一致的，从 24日开始评论数慢慢增加，到 30 日达到高峰，之后慢慢回落；

这个趋势也比较合乎常理，30日及30日之前的评论都可以被认为是用户看完点映之后反馈，也是出品商为了利益最大化，为电影增加热度的一种方式。

但是这里面比较大的一个问题是 评论数量对比，根据这个折线图显示，《五个扑水的少年》的评论数远大于《长津湖》和《我和我的父辈》，无论影评好坏，根据传播学的角度前者的热度要远高于后者，而后面票房对比结果却恰恰相反，至于为什么出现这种趋势，还请大家细品，，只能说《长津湖》出品方是真的自信，评论热度低，但票房却很出众。

与影评相关的星级分布，在这里我也做了个简单对比，《少年》、《长津湖》、《父辈》(这里偷个懒，都用简称来替换 ) 可视化效果见图2，图3、图4