Data analysis project : Eco-friendly products trends in Amazon using Spark
Today more than ever, ecology is one of the key issues facing our society. This trend has gradually emerged, to become more and more essential, especially in our everyday products. By focusing on Amazon products in this study, we will first try to observe this trend, then understand it by analyzing the characteristics of these products and their consumers, and finally try to predict its evolution and to learn from it insights about our society.
This project presents a study about the evolution of people interest concerning climate change, renewable energies, pollution and bio food using Amazon’s dataset. There are many categories of products to work on in it: [Books, Kindle Store, TV-Movies, Home and Kitchen, Health – Personal Care, Tools and Home Improvement, Grocery and Gourmet Food]. Some of them need deep consideration analysis for relevant insights. We will then do some analysis and visualizations to show the evolution
about the concerned products with the information provided by the huge amount of reviews and product descriptions.
The results, analysis or graph come from the main code referenced by
project_general.pynb, where more details can be found on our analysis, insight (code, graphs) for each section. It can be found in this Github repository : https://github.com/pelletierkevin/AmazonProject_DataAnalysis
I. Data collection – Dataset
The data set used is from Amazon reviews which contains products reviews and meta-data, including 142.8 million reviews spanning from May 1996 to July 2014. The dataset contains many categories and the project will focus on few of them: [Books, Home and Kitchen, Health – Personal Care, Tools and Home Improvement, Grocery and Gourmet Food]. They have been selected by relevance and potential link to eco-products. Those categories gather over 3.200.000 products and for each at least 5 reviews. The metadata’s products are separated from the reviews by categories. The total amount of data represents 30Gb.
The link to the dataset can be found here : http://jmcauley.ucsd.edu/data/amazon/links.html
II. Statistics – Algorithms
A. Tools and Libraries
The first step of our work is cleaning raw data. For that we use
Apache Spark on python (pyspark). Once the data was filtered and thus more compact, we moved to
pandas library for deeper analysis. In addition,
seaborn were required for data visualization,
Scikit-learn for machine learning and finally
nltk for natural language
B. Cleaning phase
The cleaning phase consists of filtering out unnecessary information from metadatas (such as URLs images of products) and corrupted data. Once done, we save it in parquet to reuse the data very easily. The second phase is extracting eco-friendly items from metadatas. A product is classified as eco-friendly if there is a match between its Title, Description or Brand with a word in an eco-friendly words list. This list was made by sentimental analysis plus an overview of the data. The dataset of products obtained was next joined to their respective reviews and save as parquet file. Finally, the dataset containing reviews has been converted to parquet file. At the end of the cleaning phase, we have for each category three datasets : metadatas, reviews and metadatas with reviews for eco-friendly product. This is summed up in the part 1 of the notebook.
C. Distribution of products over the years
The first part of our project was to analyze the number of eco-friendly products released every year. The release date of each product was not available, we made the assumption that the year of the first review of a product was its publication year. We then worked on both ecofriendly and general products. To do so, we grouped each review by its product ID, and aggregated taking only the minimal date. It was interesting to compare the number of products between the eco-friendly part and the general one, but we also computed the proportion to to observe the evolution within each part. We then computed these results using the histograms methods from
Pandas for the first one, and from
Numpy for the second one using the ‘density’ parameter. However, the data collection stopped on July 2014, so the results of this year could lead to some misinterpretation. That is why we implemented a linear regression model with degree 5 for each category, to define the mathematical evolution. We trained using data up to 2013, to then predict the most probable number of published products in 2014.
D. Reviews Analysis : Distribution and helpfulness
Now we are going to interest in the reviews distribution and the helpfulness associated. For the distribution of reviews per products, we will reduce by productID all the reviews and implement a counter of reviews for all products and produce a histogram. On the other hand, regarding the helpfulness of the reviews, the ratings are in the form [3,5] where the first element represent the number of positive vote and the second is the total number of votes. We will then compute a percentage of helpfulness for each (3/5). We took care of filtering the reviews with less than 3 votes which could produce
some bias. Indeed, there is a lot of reviews with no votes, or only 1 positive vote which is a bit sensitive and could produce wrong estimations. In order to have meaningful comparison plot, we transformed the rating votes histogram into percentages proportions. Finally, to be able to compare the categories we computed the mean number of reviews of the different categories, and the average helpfulness rating.
E. Price analysis : Evolution and comparison
In this part we focused our attention to the average price of products per year. Indeed, the ability to compare the prices tendency between eco-friendly products and the averaged products, is interesting, as the products
related to the bio area are usually known to be more expensive. Moreover, we will be able to draw some comparison
between each category. For this task, we used the release year as a reference year for every product. We then
grouped by year the dataset, aggregated by computing the mean of each group and finally outputted the histogram.
Additionally, we took care of the outliers in the extracted eco-friendly products by implementing an error value for
each bin. We calculated this error by taking the standard deviation of prices within the bin and dividing it by the
number of products represented by this bin.
Std(X) / len(X). Nevertheless, as the eco-friendly products tends to appear later in the years, we added some zeros values in the missing years to have a corresponding year index in both eco-friendly and general products. Finally, we plotted the bars representing the average price of each year of the eco-friendly products with the associated error, and the bars from the general products in the same figure. In order to have some resources to compare the different categories we computed the average price of each category by taking all the years, excluding the added zero values. Additionally, to have a more quantitative comparison, we delivered 2 dataframes outputting the differences of average prices between the eco-friendly part and the general ones of each category. The first one was meant to show the difference in percentages of the average price of the eco-friendly products compared to the general ones of the same category, whereas the second one is showing the differences in dollars.
F. Sales rank correlation
This part will produce a correlation analysis between the features regarding the number of reviews, the overall rating, the sales rank position and publishing year of each products. Indeed, we could think that a popular product regarding this reviews number and overall rating should have a better position in the sales rank than another less popular. First, we grouped the reviews by product ID and then computed the average overall rating over these groups and extracted the associated sales rank, release year and number of reviews. Pearson coefficient :
However the result did not give any correlations. We then took the analysis a bit further by clustering the products by their number of reviews and their release year using the K-Means algorithms. We were able to see if the overall
rating is correlated with the sales rank within products of the same popularity iterating over 80 clusters. Finally, after averaging the Pearson coefficient of each cluster, the result was not referencing any correlation as it was closely equal to 0. Indeed by analyzing the qualitative information, there is a lot of products with bad overall rating which have an excellent sales rank position.
G. Top Eco brands
What are the criterias that can be use to define a brand as a top one : the number of product, reviews, grades ? and so on. Looking only at overall grades, a brand with only one product with a 5 note rating will have a higher rank than a brand with 70 reviews and a overall rating of 4.9. Actually, amazon’s product grades are integers from 1 to 5 (stars). So we will round the averages overall to keep the same idea and then sort by descending order.
We wanted to look at two main things, the average score for all products for a brand plus the number of products. At first, group item by ids and count the number of reviews plus average score, and then we grouped by brand and compute the average of averages score and count the number of product per brand. Finally we sorted them by best score and number of reviews. Thus what we can see in Part 8 of the notebook, a left graph with top most reviewed brand and on the left the number of products available.
III. Findings – Results
As we cannot include all graphics, you can go here (link to notebook) to follow the process of every section and visualize the results.
A. Proportion of eco-friendly products
There is obviously a huge amount of products not concerned by the environmental aspects. The more represented category is the Garden. Books category has the smallest amount but that is not surprising. It is not very easy to find books that deal with ecological them, unless it is very specific to this domain and thus can be rare. We can ask if it is judicious to keep the books category, but 0.2% of 2.5 millions books represents 500k books which is sizable. Except books, We can observe on pie plots that the proportions are between 1.5% and 3.6% for each category.
B. Evolution of eco-friendly products by year
If we look at all histograms in log scale, we can see at first that they were not necessary present. Once they appeared on the market, although they can be in small proportion, they follow the same trend as non-eco product. First appeared in 1997 with books, then in 2000 with Home & Kitchen but we had to wait until 2004, ten years after Amazon’s creation, to see eco-products belonging to Healthcare and Grocery.
C. Distribution of the number of comments
The number of reviews whether for eco-product or not have the same curves. The most reviewed ecoproducts are in Home & Kitchen and Books with around 3000 comments whereas books category contains the most reviewed product so far with more than 20.000 votes.
Now if we look at the figure above, we can see the average number of reviews per product per categories. Although it is eco-friendly product or not, the average helpfulness score is lower for books. This may be due to opinions that can be more personals/political/committed compared to other categories where opinions are less various and personals. Also there is a important difference of rating for grocery because people might want to leave a reviews/analysis concerning a consumable eco-product (taste, quality, additives,..) and public perception.
D. Price Analysis : Evolution and comparison
In people’s minds, eco-friendly products tend to be more expensive than non-eco ones. It is not entirely false, and we came through the same result. Eco-products are on average at least 20% more expensive (see Part 4 notebook). But there is some differences within the different categories. Indeed, regarding the Healthcare category, eco-friendlies are not really more expensive than the average. In the contrary, the eco products from the garden category are way more expensive.
We wanted to verify our results and see if we can easily find some example which can illustrate it. Once on amazon.com, we searched for eco and non-eco cups. Those products can be assigned to the home category. What we found is that “non-eco” friendly cups cost $0.07/unit whereas eco-cups cost $0.15/unit . Indeed this is more than twice the price of a non-eco product.
Below the graph for Garden category where we can see relevant differences.
E. Top eco-friendly brands
For some categories such as Patio Lawn garden, the most reviewed brand Hydrofarm with 580 reviews is the one with the most products for sale (20). Whereas in Healthcare category, the top brand Diva Cup has only less than five products for sales (2) with more than 1000 reviews. Also brands in Home & Kitchen have a lot of reviews while the largest number of product for sale is only 6. If we go to amazon.com and search for the brand Zyliss, we only find 6 articles with a total number of reviews around 1000.
We can assume that the more a product has good reviews, the more people are willing to buy it and thus leave a
good review after use (are they influenced by others’ point of view ?). But that analysis does not only work for ecofriendly products but in general.
Finally we concede that we did not compute the weigthed means of the overall scores. It means that if an item with 100 reviews has a mean of 4.9, it will have the same “importance” or weight as a product with only 2 reviews and an overall mean of 1. This is what we also can observe for the brand Medline in Healthcare. They offer a lot or
products but do not get a lot of reviews.
The mentionned results are not the only one we plotted and found.
Here are some examples of plots you can find in the notebook :
To get more findings you can find them in the
project_general.ipynb notebook in the Github repository : https://github.com/pelletierkevin/AmazonProject_DataAnalysis
You can also find the report under
report/Ada_Project_Report.pdf in this Github.
If you want to find some other similar projects : https://codethekey.com/tag/project/