Amazon question/answer data

Julian McAuley, UCSD

Note: This server has been retired! You will be redirected in 5 seconds.

Description

This dataset contains Question and Answer data from Amazon, totaling around 1.4 million answered questions.

This dataset can be combined with Amazon product review data, available here, by matching ASINs in the Q/A dataset with ASINs in the review data. The review data also includes product metadata (product titles etc.).

Files

Sample question (and answer):

{ "asin": "B000050B6Z", "questionType": "yes/no", "answerType": "Y", "answerTime": "Aug 8, 2014", "unixTime": 1407481200, "question": "Can you use this unit with GEL shaving cans?", "answer": "Yes. If the can fits in the machine it will despense hot gel lather. I've been using my machine for both , gel and traditional lather for over 10 years." }

where

Per-category files

Below are files for individual product categories, which have already had duplicate item reviews removed.

Appliances (9,011 questions)
Arts Crafts and Sewing (21,262 questions)
Automotive (89,923 questions)
Baby (28,933 questions)
Beauty (42,422 questions)
Cell Phones and Accessories (85,865 questions)
Clothing Shoes and Jewelry (22,068 questions)
Electronics (314,263 questions)
Grocery and Gourmet Food (19,538 questions)
Health and Personal Care (80,496 questions)
Home and Kitchen (184,439 questions)
Industrial and Scientific (12,136 questions)
Musical Instruments (23,322 questions)
Office Products (43,608 questions)
Patio Lawn and Garden (59,595 questions)
Pet Supplies (36,607 questions)
Software (10,636 questions)
Sports and Outdoors (146,891 questions)
Tools and Home Improvement (101,088 questions)
Toys and Games (51,486 questions)
Video Games (13,307 questions)

Questions with multiple answers

Below are updated Q/A files as used in our ICDM paper. Importantly, these files include multiple answers to each question, allowing the ambiguity of answers to be studied.

Automotive (59,415 questions, 233,784 answers)
Baby (21,996 questions, 82,034 answers)
Beauty (32,936 questions, 125,652 answers)
Cell Phones and Accessories (60,761 questions, 237,220 answers)
Clothing Shoes and Jewelry (17,233 questions, 66,709 answers)
Electronics (231,449 questions, 867,921 answers)
Grocery and Gourmet Food (15,373 questions, 62,243 answers)
Health and Personal Care (63,962 questions, 255,209 answers)
Home and Kitchen (148,728 questions, 611,335 answers)
Musical Instruments (17,971 questions, 67,326 answers)
Office Products (33,984 questions, 130,088 answers)
Patio Lawn and Garden (47,574 questions, 193,780 answers)
Pet Supplies (30,848 questions, 133,274 answers)
Sports and Outdoors (114,496 questions, 444,900 answers)
Tools and Home Improvement (81,609 questions, 327,597 answers)
Toys and Games (39,549 questions, 151,779 answers)
Video Games (7,744 questions 28,893 answers)

Citation

Please cite the following if you use the data in any way:

Modeling ambiguity, subjectivity, and diverging viewpoints in opinion question answering systems
Mengting Wan, Julian McAuley
International Conference on Data Mining (ICDM), 2016
pdf

Addressing complex and subjective product-related queries with customer reviews
Julian McAuley, Alex Yang
World Wide Web (WWW), 2016
pdf

Code

Reading the data

Data can be treated as python dictionary objects. A simple script to read any of the above the data is as follows:

def parse(path): g = gzip.open(path, 'r') for l in g: yield eval(l)

Convert to 'strict' json

The above data can be read with python 'eval', but is not strict json. If you'd like to use some language other than python, you can convert the data to strict json as follows:

import json import gzip def parse(path): g = gzip.open(path, 'r') for l in g: yield json.dumps(eval(l)) f = open("output.strict", 'w') for l in parse("qa_Video_Games.json.gz"): f.write(l + '\n')

Pandas data frame

This code reads the data into a pandas data frame:

import pandas as pd import gzip def parse(path): g = gzip.open(path, 'rb') for l in g: yield eval(l) def getDF(path): i = 0 df = {} for d in parse(path): df[i] = d i += 1 return pd.DataFrame.from_dict(df, orient='index') df = getDF('qa_Video_Games.json.gz')