Dataset: Diamonds

Overview
The Diamonds dataset is a built-in dataset that comes with the tidyverse.

(Instructions)
Please follow these instructions when documenting


 * Add observations under the "Observations" section; headline each observation with "Sub-heading 2".
 * Add references for every assertion you make. Use 'ref' tags to format, see this page for instructions on tags. In each reference, provide a link to a student report.
 * For each reference, make sure to categorize under your team's name. Use the "Group" option to choose your team name.

Price Increases with Carat
Diamond price tends to increase with carat.

([TO EDIT] Additional observation!)
Source

Team Alpha

 * The minimum price for a diamond generally increases with carat. The maximum price for a diamond generally increases with carat up through 1 carat, then does not depend on carat.
 * For diamonds with similar carat, poor cut diamonds tend to fetch lower prices. However, many ideal cut diamonds still fetch low prices.
 * Higher clarity diamonds tend to fetch higher prices for a given carat.
 * Most diamonds with a flawless, IF, rating are clustered in the group of diamonds of 1.5 carats or smaller. As carat increases, lower grades of clarity make up more and more of the diamonds at each band. In diamonds of 3 carats or higher in this dataset, diamonds of 3 carats or higher are mostly have lower clarity grades, but still command high prices. While clarity is clearly associated with higher prices, especially in lower carat diamonds, large diamonds appear to claim high prices even with lower clarity ratings.
 * Cut and clarity are not strongly correlated, but most of the best clarity diamonds also have the best cut.
 * The quantities of diamonds of different clarity grades have different peaks depending on price point ; low-clarity diamonds are clustered in notably large quantities in the $3000-$5000 range.
 * Diamond color looks to be more strongly correlated with carat (better color for lower carat) than with price.
 * Diamonds tend to have "round" carat values (e.g. 1, 1.5, and 2 carats). This is especially apparent around the 2-carat line. There are almost no diamonds with slightly less than 2 carats, but many with exactly 2 carats.



Team Beta

 * We investigated the effect of a diamond's area on its price. We hypothesized that diamonds with larger faces are more valuable.
 * We added a column called area (product of x and y) to the dataframe.
 * Pair plots show linear relationship between area and carat, so whatever conclusions we draw for area
 * Observation: On average, higher cut classes seem to have better prices for larger area values.
 * Higher class `cut`s have higher prices at larger values
 * Observation: Diamonds cut close to a size (carat) breakpoint are less likely to be cut well
 * The proportion of diamonds in the ideal or premium category drops the closer the size gets to 0.3, 0.5, 0.7 and 1.0
 * This is likely because diamond cutters get more value from keeping the stone larger, but at at poorer cut.



Observations
Data is sparse above 2.5 carats.

Price is correlated with carat between 0 and 2 carats.

Higher-quality cuts are less prevalent at greater carats.

For a given carat, higher-quality cuts are generally more valuable than lower quality cuts.



Team Delta

 * Number of samples over 2.5 carats decreases significantly, making it hard to identify trends
 * Per carat, the price for diamonds with ideal cut is higher than other cuts
 * Clustering around whole number carat sizes
 * Hypothesis: diamond cutting is done to ascertain a round number carat
 * Hypothesis: there is some rounding near whole carat weight (?)
 * Larger carat sizes tend to have a lower quality cut
 * There appears to be a ceiling for price
 * Why is this?

Observations

 * Over the whole dataset, there is a negative relationship between cut quality and price, and a positive relationship between carat and price.
 * However, if carat is held constant, then there is a positive relationship between cut quality and price.
 * At a given carat, there is price variation that cannot be fully explained by cut.
 * Smaller carat has a smaller variability in price than larger carat.
 * Carat appears to be not smoothly distributed, with certain buckets that have much higher counts, usually at or just above a fractional carat increment.
 * Ideal cuts are most common for smaller carat diamonds while lower quality cuts are more common as carat increases



Team Zeta
In general, higher-carat diamonds are more expensive than lower-carat diamonds.

For a given carat value, higher quality cuts are priced higher than lower quality cuts.

The higher quality the diamond’s cut, the less the need to have a high carat diamond to fetch a high price.

Most fair cut diamonds are also lower carat, resulting in a lower price.

Each cut demonstrates a logarithmic relationship between carat and price (this is a trend, not a rule).

When broken down by carat ranges (1-1.5, 1.5-2, 2-2.5, 2.5-3), the frequency of a carat assignment being closer to a whole or half carat (x.0 and x.5, respectively) is higher than being between those values. Diamonds are naturally formed, so we would assume this distribution is more random, but we learned that humans are involved in carat assignment, making it likely that people cutting diamonds are intentionally optimizing for “friendly” carat values.

The number of diamonds of each cut increases with the cut’s quality; there are few Fair cut diamonds in the set compared to the number of Ideal cut diamonds.

The data set appears to have an artificial bound at the $20,000 mark, especially given that there exist no fair cut entries above 4 carats which is not aligned with the overall trends of the dataset.This artificial bound is likely caused by a combination of the aforementioned factors reducing the value of data points above this mark.



Zach
Observations


 * Price and carat exhibit a linear trend on a log-log scale; this implies a power-law relation between the two variables.(Fig. Z1)
 * Cut modulates the price; better cuts tend to result in higher prices. However, this effect seems to saturate at the highest cuts.
 * At extreme low values of carat, fair diamonds seem to be higher in price than other cuts. However, the wide confidence bounds on the smoothed trend indicate the trend line is less trustworthy in this region.
 * Furthermore, a regression is generally less trustworthy at the edges of the data, as there tend to be fewer data in those extreme regions.
 * The lowest-quality cut diamonds extend to higher carat than other cuts.
 * The smoothed trend implies that highest-quality diamonds experience a price decrease at the highest cuts; however, the confidence bounds cast doubt on this conclusion.




 * Carat values tend to land at "special" fractional values.(Fig. Z2)
 * Values tend to land at or above; the peaks are asymmetric about the "special" fractional values.